Aneuploidy, Mutation, AFP, CA-125, CA19-9, CEA, HGF, OPN, Prolactin, TIMP-1) and on Aneuploidy only.Aneuploidy only results in 80% correctly classified cancer samples with precision at 94%.Referring to this original publication and this, with a new sequencing method applied.
import pandas as pd
import numpy as np
import warnings
import pickle
import time
import copy
import itertools
from joblib import dump, load
import shap
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='darkgrid')
from sklearn.metrics import recall_score, precision_score, make_scorer
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import cross_validate, cross_val_predict
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn import tree, ensemble
import xgboost as xgb
from catboost import CatBoostClassifier
import lightgbm as lgb
from scipy.stats import normaltest, levene, bartlett
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 100)
np.set_printoptions(threshold=100)
# Load the dataset
data = pd.read_excel('Dataset S7.xlsx', skiprows=1)
print(data.shape)
print("Unique patient IDs: {}".format(len(data['Patient ID #'].unique())))
display(data.head())
display(data.info())
(1695, 14) Unique patient IDs: 1695
| Patient ID # | Sample ID # | Tumor type | AJCC Stage | Aneuploidy | Mutation | AFP (pg/ml) | CA-125 (U/ml) | CA19-9 (U/ml) | CEA (pg/ml) | HGF (pg/ml) | OPN (pg/ml) | Prolactin (pg/ml) | TIMP-1 (pg/ml) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | CRC 455 | CRC 455 PLS 1 | Colorectum | I | 0.176106 | 2.962820 | 1583.450 | 5.090 | 16.452 | 540.07 | 377.26 | 56516.58 | 11606.60 | 56428.71 |
| 1 | CRC 456 | CRC 456 PLS 1 | Colorectum | I | 0.621596 | 2.445405 | 715.308 | 7.270 | 40.910 | 5902.43 | 659.68 | 61001.39 | 14374.99 | 73940.49 |
| 2 | CRC 457 | CRC 457 PLS 1 | Colorectum | II | 0.591770 | 1.215758 | 4365.530 | 4.854 | 16.452 | 973.75 | 329.07 | 88896.24 | 38375.00 | 22797.28 |
| 3 | CRC 458 | CRC 458 PLS 1 | Colorectum | II | 0.562455 | 1.640793 | 715.308 | 5.390 | 16.452 | 2027.53 | 266.66 | 42549.61 | 12072.51 | 20441.19 |
| 4 | CRC 459 | CRC 459 PLS 1 | Colorectum | II | 0.052949 | 1.325771 | 801.300 | 4.854 | 16.452 | 614.49 | 370.88 | 24274.11 | 23718.17 | 56288.51 |
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1695 entries, 0 to 1694 Data columns (total 14 columns): Patient ID # 1695 non-null object Sample ID # 1695 non-null object Tumor type 1695 non-null object AJCC Stage 883 non-null object Aneuploidy 1131 non-null float64 Mutation 1630 non-null float64 AFP (pg/ml) 1695 non-null float64 CA-125 (U/ml) 1695 non-null float64 CA19-9 (U/ml) 1695 non-null float64 CEA (pg/ml) 1695 non-null float64 HGF (pg/ml) 1695 non-null float64 OPN (pg/ml) 1695 non-null float64 Prolactin (pg/ml) 1695 non-null float64 TIMP-1 (pg/ml) 1695 non-null float64 dtypes: float64(10), object(4) memory usage: 185.5+ KB
None
This time there seem to be adequate number of Patient IDs and in accordance to what's communicated in the publication; 1695. We can also see that there are 883 patients with cancer as displayed by the non-null values in column AJCC Stage.
Furthermore, we can drop columns that are not needed, such as Sample ID # and Patient ID #, or are too related to the target variable, as AJCC Stage, as well as replacing NaN values with 0 in the columns Aneuploidy and Mutation.
# Delete columns Sample ID and AJCC Stage and Patient ID
if "Sample ID #" in data.columns:
data.drop(["Sample ID #", "AJCC Stage", "Patient ID #"], axis=1, inplace=True)
### Change column "Tumor type" to ordinal categorical ###
tumor_types = {'Colorectum': 1, 'Lung': 2, 'Breast': 3, 'Pancreas': 4,
'Ovary': 5, 'Esophagus': 6, 'Liver': 7, 'Stomach': 8,
'Normal': 9}
if 'Colorectum' in data['Tumor type'].unique():
data.replace({'Tumor type': tumor_types}, inplace=True)
### inverse mapping for later use
inv_tumor_type_mapping = {v: k for k, v in tumor_types.items()}
# Lastly, clean up the column names
new_cols = ['Tumor type', 'Aneuploidy', 'Mutation', 'AFP', 'CA-125',
'CA19-9', 'CEA', 'HGF', 'OPN', 'Prolactin', 'TIMP-1']
data.columns = new_cols
data.describe([.25, .5, .75, .90])
| Tumor type | Aneuploidy | Mutation | AFP | CA-125 | CA19-9 | CEA | HGF | OPN | Prolactin | TIMP-1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1695.000000 | 1131.000000 | 1630.000000 | 1695.000000 | 1695.000000 | 1695.000000 | 1695.000000 | 1695.000000 | 1695.000000 | 1695.00000 | 1695.000000 |
| mean | 5.854867 | 0.380008 | 3.955169 | 6643.809436 | 21.624168 | 53.101788 | 4346.874935 | 318.400186 | 55464.116428 | 30426.35720 | 69727.203068 |
| std | 3.405522 | 0.383222 | 18.444197 | 50066.948263 | 154.670070 | 419.604271 | 23874.785867 | 481.549152 | 47854.633728 | 50014.47526 | 47238.133181 |
| min | 1.000000 | 0.000370 | 0.000000 | 706.158000 | 4.608000 | 14.214000 | 426.438000 | 158.334000 | 3218.166000 | 806.28000 | 976.550000 |
| 25% | 2.000000 | 0.028398 | 0.703514 | 822.144000 | 4.884000 | 16.320000 | 603.730000 | 163.995000 | 25575.805000 | 8412.78000 | 41127.780000 |
| 50% | 8.000000 | 0.195505 | 0.959890 | 929.640000 | 4.980000 | 16.482000 | 1035.150000 | 183.180000 | 40388.190000 | 13521.30000 | 58986.980000 |
| 75% | 9.000000 | 0.835059 | 1.317015 | 1845.855000 | 6.270000 | 18.380000 | 1886.430000 | 290.090000 | 67337.810000 | 26075.09000 | 82923.445000 |
| 90% | 9.000000 | 0.950337 | 3.728797 | 3836.556000 | 12.950000 | 38.104000 | 3700.642000 | 508.470000 | 117451.458000 | 73188.27000 | 118113.364000 |
| max | 9.000000 | 0.999718 | 333.234911 | 600608.892000 | 3329.740000 | 12491.472000 | 337245.426000 | 11432.980000 | 406443.400000 | 608432.38200 | 569512.690000 |
def display_missing_values(train_set, test_set):
''' Display the percent of missing values in both the train and test sets
'''
train = (train_set.isnull().sum() / train_set.count()*100).to_dict()
test = (test_set.isnull().sum() / test_set.count()*100).to_dict()
missing_df = pd.DataFrame(data={'Missing in train set (%)': list(train.values()),
'Missing in test set (%)': list(test.values())},
index=train.keys())
return missing_df
# Plot missing values to get another sense of it
plt.figure(figsize=(10, 4))
# Plot missing values
sns.heatmap(data.isnull(), cbar=False)
plt.title("Missing values", size=18);
print(data.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1695 entries, 0 to 1694 Data columns (total 11 columns): Tumor type 1695 non-null int64 Aneuploidy 1131 non-null float64 Mutation 1630 non-null float64 AFP 1695 non-null float64 CA-125 1695 non-null float64 CA19-9 1695 non-null float64 CEA 1695 non-null float64 HGF 1695 non-null float64 OPN 1695 non-null float64 Prolactin 1695 non-null float64 TIMP-1 1695 non-null float64 dtypes: float64(10), int64(1) memory usage: 145.8 KB None
Some algorithms will have probelms with missing values. As there are quite some in the Aneuploidy and Mutation columns they will be replaced with nulls.
# Replace NaN values with 0
data.loc[:, "Aneuploidy"].fillna(0, inplace=True)
data.loc[:, "Mutation"].fillna(0, inplace=True)
# Display that there are no missing values left
print(data.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1695 entries, 0 to 1694 Data columns (total 11 columns): Tumor type 1695 non-null int64 Aneuploidy 1695 non-null float64 Mutation 1695 non-null float64 AFP 1695 non-null float64 CA-125 1695 non-null float64 CA19-9 1695 non-null float64 CEA 1695 non-null float64 HGF 1695 non-null float64 OPN 1695 non-null float64 Prolactin 1695 non-null float64 TIMP-1 1695 non-null float64 dtypes: float64(10), int64(1) memory usage: 145.8 KB None
# Start by randomly shuffle the dataset
sh_data = data.sample(frac=1, random_state=56)
# Split into features and target variable
X = sh_data[new_cols[1:]]
Y = sh_data['Tumor type']
# Split into train and test sets
trainX, testX, trainY, testY = train_test_split(X, Y, test_size=0.2, random_state=155, stratify=Y)
# Print stats
print(f'Full data shape: {sh_data.shape}')
display(sh_data.head())
print(f'Train set shape: {trainX.shape}')
display(trainX.head())
print(f'Test set shape: {testX.shape}')
display(testX.head())
Full data shape: (1695, 11)
| Tumor type | Aneuploidy | Mutation | AFP | CA-125 | CA19-9 | CEA | HGF | OPN | Prolactin | TIMP-1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 395 | 1 | 0.640274 | 1.072957 | 833.736 | 4.890 | 16.464 | 788.16 | 163.776 | 31743.02 | 13036.78 | 61515.64 |
| 1573 | 9 | 0.000000 | 1.223815 | 841.524 | 5.010 | 21.290 | 1946.49 | 167.010 | 45661.52 | 12870.75 | 64887.38 |
| 263 | 3 | 0.400011 | 0.698252 | 929.640 | 4.896 | 16.422 | 3336.38 | 335.340 | 49477.59 | 22518.63 | 54411.19 |
| 924 | 9 | 0.000000 | 0.000000 | 1132.130 | 5.950 | 31.400 | 744.44 | 183.340 | 22467.43 | 31331.84 | 22634.00 |
| 289 | 2 | 0.920944 | 0.968599 | 929.640 | 46.050 | 16.422 | 2083.04 | 161.112 | 58583.33 | 5461.15 | 58134.12 |
Train set shape: (1356, 10)
| Aneuploidy | Mutation | AFP | CA-125 | CA19-9 | CEA | HGF | OPN | Prolactin | TIMP-1 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 947 | 0.000000 | 0.424321 | 1198.170 | 4.848 | 42.600 | 1261.97 | 291.210 | 37876.73 | 10129.65 | 105320.72 |
| 1157 | 0.011644 | 0.981975 | 2701.010 | 4.962 | 16.602 | 1918.02 | 165.504 | 19256.44 | 9392.72 | 26216.75 |
| 344 | 0.926522 | 0.437984 | 3292.880 | 4.944 | 16.686 | 4261.12 | 168.576 | 29324.15 | 200229.11 | 55442.36 |
| 1378 | 0.000000 | 1.132631 | 851.052 | 4.962 | 16.602 | 853.27 | 165.504 | 14937.94 | 6430.83 | 55557.46 |
| 1568 | 0.007435 | 0.890999 | 867.440 | 4.902 | 15.744 | 1587.54 | 201.470 | 79312.06 | 38229.33 | 46370.67 |
Test set shape: (339, 10)
| Aneuploidy | Mutation | AFP | CA-125 | CA19-9 | CEA | HGF | OPN | Prolactin | TIMP-1 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 327 | 0.023964 | 0.688304 | 807.054 | 4.926 | 16.38 | 786.900 | 162.372 | 8494.63 | 8320.69 | 33798.58 |
| 1430 | 0.000000 | 0.000000 | 789.042 | 4.998 | 21.20 | 467.778 | 165.036 | 32003.82 | 11497.78 | 114827.70 |
| 1072 | 0.000000 | 0.947029 | 802.884 | 5.770 | 16.47 | 950.980 | 163.752 | 22480.79 | 19227.07 | 29727.22 |
| 1088 | 0.000000 | 0.684181 | 802.884 | 4.980 | 16.47 | 448.872 | 163.752 | 17618.01 | 35507.00 | 35600.02 |
| 1628 | 0.657563 | 1.193011 | 834.300 | 7.560 | 57.32 | 1207.820 | 388.450 | 74679.41 | 17924.22 | 150908.39 |
Start by creating custom transformers that can do the desired transformation. From the Supportive material related with the publication:
To account for variations in the lower limits of detection across different experiments, we found the 90th percentile feature value in the healthy training samples. We then found any feature value below that threshold and set all values to the 90th percentile threshold. This transformation was done for all training and testing samples. This procedure was done for aneuploidy scores, somatic mutation scores, and protein concentrations.
class FeatureSelector(BaseEstimator, TransformerMixin):
''' Custom Transformer that extracts columns passed as arguments to its constructor
'''
# Class constructor
def __init__(self, feature_names):
self.feature_names = feature_names
# Return itself
def fit(self, X, y=None):
return self
# Return selected columns
def transform(self, X, y=None):
return X.loc[:, self.feature_names]
class PercentileTransformer(BaseEstimator, TransformerMixin):
''' Custom transformer that replaces all cancer samples that are lower
than the healthy 90th percentile with the 90th percentile value.
'''
# Class constructor
def __init__(self, percentile=.90, healthy_class=9):
''' Percentile refers to the desired percentile to the the transformation on
while healthy_class refers to the class label of the healthy samples.'''
self.percentile = percentile
self.healthy_class = healthy_class
# Return self
def fit(self, X, y):
# Check if X is DataFrame, if not convert it
if not isinstance(X, pd.DataFrame):
X = pd.DataFrame(X)
# Create copy and fill NaN values with zero
X = X.fillna(0.0)
# Calculate thresholds for each column
thresholds = X.loc[y == self.healthy_class, :].quantile(q=self.percentile,
interpolation='linear').to_dict()
# Store for later use
self.thresholds = thresholds
return self
# Custom transform method to replace cancer values that are below the healthy 90th percentile
def transform(self, X, y=None):
# If X is not DataFrame, convert it to DataFrame
if not isinstance(X, pd.DataFrame):
X = pd.DataFrame(X)
# Create copy and fill NaN values with zero
X_ = X.copy(deep=True)
X_ = X_.fillna(0.0)
# Replace values lower than the (90th) percentile
for p in self.thresholds:
X_[p] = X_[p].apply(lambda x: self.thresholds[p] if x < self.thresholds[p] else x)
return X_
# Create transformer, fit to train set and transform both the train and test set
pt = PercentileTransformer()
pt.fit(trainX, trainY)
trainX = pt.transform(trainX)
testX = pt.transform(testX)
Visualise the data using Boxplots, histograms and correlation plots to get a better understanding of it.
Start by plotting the difference in distribution before and after winsorizing of the healthy samples using histograms. Display the cancerous distribution to the far right.
plt.figure(figsize=(25, 180))
counter = 1
for _, column in enumerate(sh_data.select_dtypes(include=['float']).columns, 1):
# Plot distribution
plt.subplot(40, 2, counter)
sns.distplot(sh_data[column], bins=60)
plt.title(f'{column} Train & test set distribution before transformation', size=14)
# Plot test set distribution
plt.subplot(40, 2, counter+1)
sns.distplot(trainX[column], bins=60, label='Train')
sns.distplot(testX[column], bins=60, label='Test')
plt.title(f'{column} Train & Test set distribution after transformation', size=14)
plt.legend()
counter += 2
plt.subplots_adjust(hspace=.35, wspace=.13)
As displayed above there are considerable differences in distribution and their size after applying winsorizing compared with before. The spikes on the far right side on the right plots are fairly obvious but also reasonable.
Continue by plotting the total count of each Tumor type, including healthy Normal.
plt.figure(figsize=(16,5))
plt.subplot(1, 2, 1)
sns.countplot(y=trainY)
plt.title("Tumor Counts on Train set", size=15)
plt.xlabel("Count")
plt.grid()
plt.subplot(1, 2, 2)
sns.countplot(y=testY)
plt.title("Tumor Counts on Test set", size=15)
plt.xlabel("Count")
plt.grid();
Plot the target variable Tumor type relative the variables using boxplots while histograms aids the understanding of each variable's distribution.
plt.figure(figsize=(24,100))
train_plot = pd.merge(trainY, trainX, left_index=True, right_index=True)
test_plot = pd.merge(testY, testX, left_index=True, right_index=True)
counter = 1
for _, column in enumerate(train_plot.select_dtypes(include=['float']).columns, 1):
plt.subplot(20, 4, counter)
sns.boxplot(x="Tumor type", y=column, data=train_plot)
plt.title(column+" -- train set", size=14)
plt.subplot(20, 4, counter+1)
sns.boxplot(x="Tumor type", y=column, data=test_plot)
plt.title(column+" -- test set", size=14)
counter += 2
plt.subplots_adjust(hspace=0.3, wspace=0.25)
del test_plot
Following histograms shows the distribution for each variable as well as t-statistic and p-values from a Normal test. None of the variables are normally distributed.
plt.figure(figsize=(24,130))
counter = 1
# Plot histograms of the Non-transformed data
for _, column in enumerate(trainX.select_dtypes(include=['float']).columns, 1):
#### Plot the train set ####
to_plot1 = trainX[column]
plt.subplot(20, 4, counter)
# Plot a histogram
sns.distplot(to_plot1, bins=30)
# Add a vertical line at the mean.
plt.axvline(to_plot1.mean(), color='r', linestyle='solid', linewidth=1)
# Add a vertical line at one standard deviation above the mean.
plt.axvline(to_plot1.mean() + to_plot1.std(), color='r', linestyle='dashed', linewidth=1)
# Add a vertical line at one standard deviation below the mean.
plt.axvline(to_plot1.mean()-to_plot1.std(), color='r', linestyle='dashed', linewidth=1)
# Calculate statistics for normality
k2, p = normaltest(trainX.loc[:, column])
plt.title(column + "\n\nt-stat={0:.2}, p-val={1:.2}".format(k2, p), size=14)
#### Plot the test set ####
to_plot2 = testX[column]
plt.subplot(20, 4, counter+1)
# Plot a histogram
sns.distplot(to_plot2, bins=30)
# Add a vertical line at the mean.
plt.axvline(to_plot2.mean(), color='r', linestyle='solid', linewidth=1)
# Add a vertical line at one standard deviation above the mean.
plt.axvline(to_plot2.mean() + to_plot2.std(), color='r', linestyle='dashed', linewidth=1)
# Add a vertical line at one standard deviation below the mean.
plt.axvline(to_plot2.mean()-to_plot2.std(), color='r', linestyle='dashed', linewidth=1)
# Calculate statistics for normality
k2, p = normaltest(testX.loc[:, column])
plt.title(column + "\n\nt-stat={0:.2}, p-val={1:.2}".format(k2, p), size=14)
counter += 2
plt.subplots_adjust(hspace=0.5, wspace=0.35)
Plot correlation matrix to display correlation between the target variable and features as well as among features.
plt.figure(figsize=(10,9))
corr_mat = train_plot.corr()
sns.heatmap(corr_mat, square=True, annot=True, linewidths=.5)
plt.title('Correlation Matrix\n', size=20)
plt.xticks(size=12, rotation=45, horizontalalignment="right")
plt.yticks(size=12);
There is only minor collinearity between features, no need to drop any.
display(train_plot.head())
print(train_plot.shape)
| Tumor type | Aneuploidy | Mutation | AFP | CA-125 | CA19-9 | CEA | HGF | OPN | Prolactin | TIMP-1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 947 | 9 | 0.030115 | 1.152125 | 2913.149 | 7.16 | 42.600 | 2066.135 | 291.210 | 53695.137 | 21200.991 | 105320.720 |
| 1157 | 9 | 0.030115 | 1.152125 | 2913.149 | 7.16 | 22.862 | 2066.135 | 266.529 | 53695.137 | 21200.991 | 83402.143 |
| 344 | 2 | 0.926522 | 1.152125 | 3292.880 | 7.16 | 22.862 | 4261.120 | 266.529 | 53695.137 | 200229.110 | 83402.143 |
| 1378 | 9 | 0.030115 | 1.152125 | 2913.149 | 7.16 | 22.862 | 2066.135 | 266.529 | 53695.137 | 21200.991 | 83402.143 |
| 1568 | 9 | 0.030115 | 1.152125 | 2913.149 | 7.16 | 22.862 | 2066.135 | 266.529 | 79312.060 | 38229.330 | 83402.143 |
(1356, 11)
def plot_feature_importance(estimator, feature_names, figsize=(15,6)):
""" Function for plotting feature importance"""
feat_imp = pd.Series(estimator.feature_importances_, feature_names).sort_values(ascending=False)
feat_imp.plot(kind='bar', title='Feature Importances', figsize=figsize)
plt.ylabel('Feature Importance Score')
plt.xticks(rotation=45, horizontalalignment="right")
plt.grid();
def plot_confusion_matrix(testY, predicted, target_names=None, title='Confusion matrix',
cmap=None, normalize=False, figsize=(8,6)):
"""
Arguments
---------
cm: true values on the test set
predicted: predicted values on test set
target_names: given classification classes such as [0, 1, 2]
the class names, for example: ['high', 'medium', 'low']
title: the text to display at the top of the matrix
cmap: the gradient of the values displayed from matplotlib.pyplot.cm
see http://matplotlib.org/examples/color/colormaps_reference.html
plt.get_cmap('jet') or plt.cm.Blues
normalize: If False, plot the raw numbers
If True, plot the proportions
figsize: Size of figure, specified as (x, y). For example (7, 5)
Citation
---------
http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
Setting Seaborn plotting options
---------
https://seaborn.pydata.org/tutorial/aesthetics.html#overriding-elements-of-the-seaborn-styles
"""
cm = confusion_matrix(testY, predicted)
accuracy = np.trace(cm) / float(np.sum(cm))
misclass = 1 - accuracy
sns.set_style("darkgrid", {"axes.grid": False}) # Remove the grid
plt.figure(figsize=figsize)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title, size=16)
plt.colorbar()
if target_names is not None:
tick_marks = np.arange(len(target_names))
plt.xticks(tick_marks, target_names, rotation=45, horizontalalignment="right")
plt.yticks(tick_marks, target_names)
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
thresh = cm.max() / 1.5 if normalize else cm.max() / 2
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
if normalize:
plt.text(j, i, "{:0.4f}".format(cm[i, j]),
horizontalalignment="center", verticalalignment="center",
color="black" if cm[i, j] > thresh else "white")
else:
plt.text(j, i, "{:,}".format(cm[i, j]),
horizontalalignment="center", verticalalignment="center",
color="black" if cm[i, j] > thresh else "white")
plt.tight_layout()
plt.ylabel('True label', size=14)
plt.xlabel('Predicted label', size=14)
plt.show()
# Specify a cancer types list for later use
cancers = ['Colorectum', 'Lung', 'Breast', 'Pancreas', 'Ovary',
'Esophagus', 'Liver', 'Stomach', 'Healthy', 'All']
def specificity_score(true, predicted, labels=None, pos_label=1,
average='weighted',sample_weight=None, zero_division="warn"):
"""Specificity scorer function for Cross Validation. """
# Create confusion matrix
conf_matrix = pd.DataFrame(confusion_matrix(true, predicted))
new_ind = conf_matrix.index.append(pd.Index(["All"]))
# Add a summary column and index
conf_matrix["All"] = conf_matrix.sum(axis=1)
indAll = pd.DataFrame(conf_matrix.sum(axis=0)).T
conf_matrix = pd.concat([conf_matrix, indAll], axis=0)
conf_matrix.index = new_ind
# Calculate Specificity
specificity = conf_matrix.loc[8, 8] / conf_matrix.loc[8, "All"]
return specificity
def cv_score_summary(cv_scores):
""" Function for returning a summary over CV scores with standard deviation"""
# Create a summary over Cross Validation scores.
cv_scores_df = pd.DataFrame(columns=["Scores", "Std"],
index=['Specificity (med)','Sensitivity (med)',
'Sensitivity weighted (med)', 'AUC (med)',
'Specificity (mean)','Sensitivity (mean)',
'Sensitivity weighted (mean)', 'AUC (mean)'])
# Median scores
specificity_median = round(np.median(cv_scores["test_specificity"]), 4)
sensitivity_median = round(np.median(cv_scores["test_sensitivity"]), 4)
sensitivity_weigh_med = round(np.median(cv_scores['test_sensitivity_w']),4)
auc_median = round(np.median(cv_scores["test_roc_auc"]), 4)
# Mean scores
specificity_mean = round(np.mean(cv_scores["test_specificity"]), 4)
sensitivity_mean = round(np.mean(cv_scores["test_sensitivity"]), 4)
sensitivity_weigh_mean = round(np.mean(cv_scores['test_sensitivity_w']),4)
auc_mean = round(np.mean(cv_scores["test_roc_auc"]), 4)
# Add scores to dataframe
cv_scores_df["Scores"] = [specificity_median, sensitivity_median, sensitivity_weigh_med, auc_median,
specificity_mean, sensitivity_mean, sensitivity_weigh_mean, auc_mean]
# Standard deviations
specificity_std = round(np.std(cv_scores["test_specificity"]), 4)
sensitivity_std = round(np.std(cv_scores["test_sensitivity"]), 4)
auc_std = round(np.std(cv_scores["test_roc_auc"]), 4)
# Add std to dataframe
cv_scores_df["Std"] = [specificity_std, sensitivity_std, sensitivity_std, auc_std]*2
return cv_scores_df
def cancer_type_stats(true, predicted):
"""Function to print Sensitivity and Specificity for each
of the cancer types (healthy included).
Returns a DataFrame with sensitivities and a floting number with specificity
"""
# Create confusion matrix
conf_matrix = pd.DataFrame(confusion_matrix(true, predicted))
new_ind = conf_matrix.index.append(pd.Index(["All"]))
# Add a summary column and index
conf_matrix["All"] = conf_matrix.sum(axis=1)
indAll = pd.DataFrame(conf_matrix.sum(axis=0)).T
conf_matrix = pd.concat([conf_matrix, indAll], axis=0)
conf_matrix.index = new_ind
# Calculate Sensitivities
sensitivities = {}
for i in conf_matrix.columns:
sensitivities["{}".format(i)] = conf_matrix.loc[i, i] / conf_matrix.loc[i, "All"]
# Calculate Specificity
specificity = conf_matrix.loc[8, 8] / conf_matrix.loc[8, "All"]
# Specify a list with the various cancers and create a dataframe
cancers = ['Colorectum', 'Lung', 'Breast', 'Pancreas', 'Ovary',
'Esophagus', 'Liver', 'Stomach', 'Healthy', 'All']
stats = pd.DataFrame(sensitivities, index=range(len(conf_matrix.columns)))
stats.columns = stats.columns
stats.columns = cancers[:len(conf_matrix.columns)]
stats = stats.iloc[[0],:-2]
# Calculate confidence intervals for the Sensitivities
counts = cancer_count(true).iloc[:, :-1]
empty = pd.DataFrame(np.zeros([1, 8]), columns=counts.columns, index=["1"])
more_stats = pd.concat([stats, empty, counts])
more_stats.rename(index={0: "Sensitivities", "1": "Conf_Int"}, inplace=True)
z = 1.96 # 95% confidence interval
n = len(predicted)
for i in range(len(more_stats.columns)):
p = more_stats.iloc[0, i]
more_stats.iloc[1, i] = z * np.sqrt((p*(1-p))/n)
del counts, empty, stats # delete what's not needed anymore
return more_stats, specificity
def plot_sensitivities(test_predicted, Ytest, title="Sensitivity per Cancer Type"):
""" Function for plotting sensitivities on model.
X is a pandas DataFrame and Y is a pandas Series while model is the model whose sensitivity to plot
"""
# Create and store some statistics on the best performing model
stats = cancer_type_stats(Ytest, test_predicted)[0]
# Display the resulting Sensitivities
plt.figure(figsize=(8,5))
sns.barplot(data=stats.head(1),
yerr=stats.iloc[1,:].values,
error_kw={"capsize": 6, "capthick": 1})
plt.title(title, size=18)
plt.xticks(size=12)
plt.yticks(size=12)
plt.grid()
plt.ylabel("Sensitivity", size=16)
plt.show()
def cancer_count(true_values):
"""Function to display the count for each cancer type in the TEST dataset"""
Ccounts = pd.DataFrame(pd.Series(true_values).value_counts()).sort_index(inplace=False).T
Ccounts.columns = Ccounts.columns.map(inv_tumor_type_mapping)
return Ccounts
The pipeline includes all steps that are needed to run 10-fold cross-validation and will take the cleaned data to perform data transformation, model tuning and the model evaluation steps. For visualisation purposes, the transformation steps are covered earlier in the notebook, but in order to run unbiased cross-validation those steps have to be executed concurrently in each fold during cross-validation.
In order to achieve this, a custom transformer, that replaces all protein, omega and mutation values that are lower than the 90th percentile healthy sample with the 90th percentile value were created earlier.
def crossVal(gs_object, Xvalues, Yvalues, nestedCV=False, cv_folds=10, n_jobs=-1, verbose=3):
'''
Cross validate the model with 10 stratified folds on Specificity, Sensitivity, Accuracy and AUC.
gs_object is a GridSearchCV object or Pipeline. Set nestedCV=True to perform nested cross-validation
'''
# If gs_object is a pipeline or nestedCV is True run cross-validation directly on the gs_object
if nestedCV or isinstance(gs_object, Pipeline):
to_evaluate = gs_object
else:
to_evaluate = gs_object.best_estimator_
# Evaluate model
cv_scores = cross_validate(to_evaluate, Xvalues, Yvalues, cv=cv_folds,
scoring={'specificity' : make_scorer(specificity_score,
average='weighted'),
'sensitivity' : 'recall_macro',
'sensitivity_w': 'recall_weighted',
'accuracy' : 'accuracy',
'roc_auc' : 'roc_auc_ovo',
'roc_auc_w' : 'roc_auc_ovo_weighted'},
n_jobs=n_jobs, return_train_score=True, verbose=verbose)
# Pause for some seconds for slimmer printouts
time.sleep(1)
# Print statistics
print('\nModel report')
if not isinstance(gs_object, Pipeline):
print(f'Best parameters: {gs_object.best_params_}')
print(f'Best score: {gs_object.best_score_}')
print('\nCross Validated scores')
print('Specificity (test): {:.4f}'.format(np.mean(cv_scores['test_specificity'])))
print('Sensitivity weighted (test): {:.4f}'.format(np.mean(cv_scores['test_sensitivity_w'])))
print('Sensitivity (test): {:.4f}'.format(np.mean(cv_scores['test_sensitivity'])))
print('AUC (train): {:.2f}'.format(np.mean(cv_scores['train_roc_auc'])))
print('AUC (test): {:.4f}\n'.format(np.mean(cv_scores['test_roc_auc'])))
return cv_scores
Select numerical columns to feed the numerical pipeline. Of course, in this case there are only numerical columns to feed the pipelile with.
# Select numerical features
numerical_features = list(sh_data.columns[1:])
# Define the steps in the numerical pipeline
numerical_pipeline = Pipeline(steps=[('numerical_selector', FeatureSelector(numerical_features)),
('PercentileTransformer', PercentileTransformer(percentile=0.90,
healthy_class=9)),
('StandardScaler', StandardScaler())])
%%time
# Select predictors and target variable
X = sh_data[numerical_features]
Y = sh_data['Tumor type']
# Split into train and test sets
trainX, testX, trainY, testY = train_test_split(X, Y, test_size=0.2, random_state=89, stratify=Y)
# Create classifiers
logReg = LogisticRegression(max_iter=1000)
knn = KNeighborsClassifier()
svc = SVC(probability=True)
rf = ensemble.RandomForestClassifier()
gb = ensemble.GradientBoostingClassifier(learning_rate=0.1)
ct = CatBoostClassifier(learning_rate=.1,
eval_metric='MultiClass',
bootstrap_type='Bernoulli',
silent=True)
lgbm = lgb.LGBMClassifier(learning_rate=0.1,
objective='multiclass')
xgboost = xgb.XGBClassifier(learning_rate=0.1,
max_depth=4,
min_child_weight=1,
gamma=0,
subsample=.8,
colsample_bytree=.8,
scale_pos_weight=1,
booster='gbtree',
eval_metric='merror',
objective='multi:softprob',
seed=29)
# Specify classifier names and add them in a list
names = ['LogisticRegression', 'KNN', 'SVC', 'RandomForest',
'GradientBoosting', 'CatBoost', 'LightGBM', 'XGBoost']
classifiers = [logReg, knn, svc, rf, gb, ct, lgbm, xgboost]
# Specify hyper parameters to tune for each classifier
parameters = [{'clf__C': [0.1, 1, 10, 50, 100]},
{'clf__n_neighbors': [3, 4, 5, 6, 7, 8, 9, 10, 11],
'clf__weights': ['uniform', 'distance'],
'clf__leaf_size': [2, 3, 4, 5, 6, 8, 10, 20],
'clf__p': [1, 2, 3]},
{'clf__C': [1, 10, 50, 100],
'clf__kernel': ['linear', 'rbf']},
{'clf__max_depth': [4, 5, 6],
'clf__n_estimators': [300, 400, 500, 600],
'clf__max_samples': [0.5, 0.7, 0.9, 1],
'clf__max_features': [0.25, 0.5, 0.75, 1]},
{'clf__n_estimators': [400, 500, 600],
'clf__max_depth': [3, 4, 5]},
{'clf__max_depth': [3, 4, 5],
'clf__n_estimators': [300, 400, 500, 600]},
{'clf__max_depth': [3, 4, 5],
'clf__n_estimators': [400, 500, 600],
'clf__num_leaves': [8, 16, 32, 64]},
{'clf__n_estimators': [200, 300, 400, 500],
'clf__max_depth': [3, 4, 5],
'clf__colsample_bytree': [.5, .75, 1.]}]
# Create dictionaries to store the results
crossVal_scores = {} ; best_models = {}; predictions = {}
# Train and evaluate a number of estimators in a pipeline
for name, classifier, params in zip(names, classifiers, parameters):
print('\n\n============================================================================')
print(f'================================== {name} ==================================')
print('============================================================================')
# Create pipeline with the 10 features and an estimator
clf_pipeline = Pipeline(steps=[('numerical_pipeline', numerical_pipeline),
('clf', classifier)])
# GridSearch
gs_clf = GridSearchCV(clf_pipeline, params,
scoring={'recall': 'recall_weighted'},
refit='recall', cv=10, n_jobs=4, verbose=3)
gs_clf.fit(trainX, trainY)
# Cross validate above parameter tuning
cv_scores = crossVal(gs_clf, X, Y, nestedCV=False, cv_folds=10)
# Pause for slimmer printouts
time.sleep(1)
# Select the best model and make predictions on the entire dataset using the pipeline
model = gs_clf.best_estimator_['clf']
clf_pipeline = Pipeline(steps=[('numerical_pipeline', numerical_pipeline),
('model', model)],
verbose=1)
preds = cross_val_predict(clf_pipeline, X, Y, cv=10, verbose=1, n_jobs=4)
# Save results to dictionaries
crossVal_scores[name] = cv_scores
best_models[name] = model
predictions[name] = preds
============================================================================ ================================== LogisticRegression ================================== ============================================================================ Fitting 10 folds for each of 5 candidates, totalling 50 fits
From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None. [Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers. From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None. [Parallel(n_jobs=4)]: Done 24 tasks | elapsed: 3.1s [Parallel(n_jobs=4)]: Done 50 out of 50 | elapsed: 9.3s finished [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 5.0s remaining: 2.1s [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 5.6s finished
Model report
Best parameters: {'clf__C': 50}
Best score: 0.6674455337690632
Cross Validated scores
Sensitivity weighted (test): 0.6690
Sensitivity (test): 0.3574
AUC (train): 0.83
AUC (test): 0.7935
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers. From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None. [Parallel(n_jobs=4)]: Done 10 out of 10 | elapsed: 3.4s finished [Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers. From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
============================================================================ ================================== KNN ================================== ============================================================================ Fitting 10 folds for each of 432 candidates, totalling 4320 fits
[Parallel(n_jobs=4)]: Done 40 tasks | elapsed: 1.4s [Parallel(n_jobs=4)]: Done 232 tasks | elapsed: 7.9s [Parallel(n_jobs=4)]: Done 552 tasks | elapsed: 18.0s [Parallel(n_jobs=4)]: Done 1000 tasks | elapsed: 33.6s [Parallel(n_jobs=4)]: Done 1576 tasks | elapsed: 48.2s [Parallel(n_jobs=4)]: Done 2280 tasks | elapsed: 1.1min [Parallel(n_jobs=4)]: Done 3112 tasks | elapsed: 1.4min [Parallel(n_jobs=4)]: Done 4072 tasks | elapsed: 1.9min [Parallel(n_jobs=4)]: Done 4320 out of 4320 | elapsed: 1.9min finished [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 2.9s remaining: 1.2s [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 3.8s finished
Model report
Best parameters: {'clf__leaf_size': 2, 'clf__n_neighbors': 8, 'clf__p': 1, 'clf__weights': 'uniform'}
Best score: 0.6556644880174292
Cross Validated scores
Sensitivity weighted (test): 0.6496
Sensitivity (test): 0.3049
AUC (train): 0.91
AUC (test): 0.7162
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers. From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None. [Parallel(n_jobs=4)]: Done 10 out of 10 | elapsed: 0.3s finished [Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers. From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
============================================================================ ================================== SVC ================================== ============================================================================ Fitting 10 folds for each of 8 candidates, totalling 80 fits
[Parallel(n_jobs=4)]: Done 24 tasks | elapsed: 6.0s [Parallel(n_jobs=4)]: Done 80 out of 80 | elapsed: 1.7min finished [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 25.2s remaining: 10.8s [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 31.9s finished
Model report
Best parameters: {'clf__C': 50, 'clf__kernel': 'linear'}
Best score: 0.6792429193899783
Cross Validated scores
Sensitivity weighted (test): 0.6779
Sensitivity (test): 0.3535
AUC (train): 0.79
AUC (test): 0.7594
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers. From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None. [Parallel(n_jobs=4)]: Done 10 out of 10 | elapsed: 36.0s finished [Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers. From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
============================================================================ ================================== RandomForest ================================== ============================================================================ Fitting 10 folds for each of 192 candidates, totalling 1920 fits
[Parallel(n_jobs=4)]: Done 24 tasks | elapsed: 10.6s [Parallel(n_jobs=4)]: Done 120 tasks | elapsed: 1.0min [Parallel(n_jobs=4)]: Done 280 tasks | elapsed: 2.5min [Parallel(n_jobs=4)]: Done 504 tasks | elapsed: 4.5min [Parallel(n_jobs=4)]: Done 792 tasks | elapsed: 7.1min [Parallel(n_jobs=4)]: Done 1144 tasks | elapsed: 11.0min [Parallel(n_jobs=4)]: Done 1560 tasks | elapsed: 15.3min [Parallel(n_jobs=4)]: Done 1920 out of 1920 | elapsed: 19.4min finished [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 9.3s remaining: 4.0s [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 12.3s finished
Model report
Best parameters: {'clf__max_depth': 6, 'clf__max_features': 0.5, 'clf__max_samples': 0.9, 'clf__n_estimators': 500}
Best score: 0.7234912854030501
Cross Validated scores
Sensitivity weighted (test): 0.7257
Sensitivity (test): 0.3820
AUC (train): 0.96
AUC (test): 0.8442
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers. From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None. [Parallel(n_jobs=4)]: Done 10 out of 10 | elapsed: 9.6s finished [Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers. From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
============================================================================ ================================== GradientBoosting ================================== ============================================================================ Fitting 10 folds for each of 9 candidates, totalling 90 fits
[Parallel(n_jobs=4)]: Done 24 tasks | elapsed: 2.8min [Parallel(n_jobs=4)]: Done 90 out of 90 | elapsed: 8.7min finished [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 1.2min remaining: 31.6s [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 1.6min finished
Model report
Best parameters: {'clf__max_depth': 3, 'clf__n_estimators': 600}
Best score: 0.7691830065359476
Cross Validated scores
Sensitivity weighted (test): 0.7717
Sensitivity (test): 0.4901
AUC (train): 1.00
AUC (test): 0.8678
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers. From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None. [Parallel(n_jobs=4)]: Done 10 out of 10 | elapsed: 1.5min finished [Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers. From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
============================================================================ ================================== CatBoost ================================== ============================================================================ Fitting 10 folds for each of 12 candidates, totalling 120 fits
[Parallel(n_jobs=4)]: Done 24 tasks | elapsed: 23.1s [Parallel(n_jobs=4)]: Done 120 out of 120 | elapsed: 4.0min finished [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 13.5s remaining: 5.8s [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 17.0s finished
Model report
Best parameters: {'clf__max_depth': 3, 'clf__n_estimators': 600}
Best score: 0.7736002178649237
Cross Validated scores
Sensitivity weighted (test): 0.7806
Sensitivity (test): 0.5274
AUC (train): 0.99
AUC (test): 0.8835
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers. From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None. [Parallel(n_jobs=4)]: Done 10 out of 10 | elapsed: 17.6s finished [Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers. From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
============================================================================ ================================== LightGBM ================================== ============================================================================ Fitting 10 folds for each of 36 candidates, totalling 360 fits
[Parallel(n_jobs=4)]: Done 24 tasks | elapsed: 18.3s [Parallel(n_jobs=4)]: Done 120 tasks | elapsed: 2.1min [Parallel(n_jobs=4)]: Done 280 tasks | elapsed: 6.1min [Parallel(n_jobs=4)]: Done 360 out of 360 | elapsed: 8.3min finished [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 13.7s remaining: 5.9s [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 17.1s finished
Model report
Best parameters: {'clf__max_depth': 3, 'clf__n_estimators': 400, 'clf__num_leaves': 8}
Best score: 0.7736165577342048
Cross Validated scores
Sensitivity weighted (test): 0.7694
Sensitivity (test): 0.4897
AUC (train): 1.00
AUC (test): 0.8707
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers. From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None. [Parallel(n_jobs=4)]: Done 10 out of 10 | elapsed: 7.3s finished [Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers. From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
============================================================================ ================================== XGBoost ================================== ============================================================================ Fitting 10 folds for each of 36 candidates, totalling 360 fits
[Parallel(n_jobs=4)]: Done 24 tasks | elapsed: 17.3s [Parallel(n_jobs=4)]: Done 120 tasks | elapsed: 1.9min [Parallel(n_jobs=4)]: Done 280 tasks | elapsed: 5.4min [Parallel(n_jobs=4)]: Done 360 out of 360 | elapsed: 7.3min finished [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 8.3s remaining: 3.6s [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 10.6s finished
Model report
Best parameters: {'clf__colsample_bytree': 0.5, 'clf__max_depth': 4, 'clf__n_estimators': 200}
Best score: 0.781726579520697
Cross Validated scores
Sensitivity weighted (test): 0.7805
Sensitivity (test): 0.4971
AUC (train): 1.00
AUC (test): 0.8881
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers. From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
CPU times: user 1min 46s, sys: 2.65 s, total: 1min 49s Wall time: 58min 19s
[Parallel(n_jobs=4)]: Done 10 out of 10 | elapsed: 7.5s finished
for name in names:
print('\n\n============================================================================')
print(f'================================== {name} ==================================')
print('============================================================================')
# Plot Confusion Matrix
plot_confusion_matrix(Y, predictions[name], target_names=[i for i in cancers[:9]],
title='Confusion Matrix')
# Plot sensitivities
plot_sensitivities(predictions[name], Y, title='Sensitivity per Cancer Type')
# Print cross-validation scores
display(pd.DataFrame(crossVal_scores[name]))
# Print classification report
print(classification_report(Y, predictions[name], target_names=cancers[:9]))
# Print the fraction of cancer/healthy samples classified
if name == 'CatBoost':
# Minor adjustment on the prediction array for the CatBoost predictions
predsCat = np.array([i[0] for i in predictions[name]])
cSamples = sum((predsCat != 9) & (Y != 9))
ccSamples = sum(predsCat != 9)
else:
cSamples = sum((predictions[name] != 9) & (Y != 9))
ccSamples = sum(predictions[name] != 9)
tot = sum(Y != 9)
print(f'Cancer samples correctly classified (sensitivity): {cSamples}, out of {tot} ({cSamples/tot:.1%})')
print(f'Total precision on cancer samples: {cSamples/ccSamples:.1%}')
# Print performance
performance = cv_score_summary(crossVal_scores[name])
display(performance)
# Calculate AUC with standard deviations
med = performance.loc['AUC (mean)', 'Scores']
std = performance.loc['AUC (mean)', 'Std']
print("{:.1%} <= AUC <= {:.1%}".format(med-std, med+std))
============================================================================ ================================== LogisticRegression ================================== ============================================================================
| fit_time | score_time | test_specificity | train_specificity | test_sensitivity | train_sensitivity | test_sensitivity_w | train_sensitivity_w | test_accuracy | train_accuracy | test_roc_auc | train_roc_auc | test_roc_auc_w | train_roc_auc_w | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.726677 | 0.414256 | 0.975309 | 0.975376 | 0.302636 | 0.383455 | 0.658824 | 0.685246 | 0.658824 | 0.685246 | 0.735853 | 0.840075 | 0.812067 | 0.883874 |
| 1 | 1.616725 | 0.216913 | 0.962963 | 0.974008 | 0.357481 | 0.367828 | 0.658824 | 0.681311 | 0.658824 | 0.681311 | 0.835631 | 0.827721 | 0.877074 | 0.876173 |
| 2 | 1.636172 | 0.425402 | 0.975309 | 0.975376 | 0.411122 | 0.367663 | 0.705882 | 0.676721 | 0.705882 | 0.676721 | 0.821657 | 0.829831 | 0.870607 | 0.877135 |
| 3 | 1.611122 | 0.434955 | 0.962963 | 0.972640 | 0.364909 | 0.375200 | 0.652941 | 0.684590 | 0.652941 | 0.684590 | 0.818883 | 0.832784 | 0.861981 | 0.880634 |
| 4 | 1.603216 | 0.193214 | 1.000000 | 0.972640 | 0.426618 | 0.367104 | 0.711765 | 0.677377 | 0.711765 | 0.677377 | 0.801469 | 0.835717 | 0.855414 | 0.878329 |
| 5 | 1.349270 | 0.187508 | 0.975309 | 0.974008 | 0.377633 | 0.365215 | 0.680473 | 0.676278 | 0.680473 | 0.676278 | 0.781998 | 0.836051 | 0.836276 | 0.880720 |
| 6 | 1.381857 | 0.183913 | 0.939024 | 0.978082 | 0.285164 | 0.392429 | 0.627219 | 0.692005 | 0.627219 | 0.692005 | 0.750243 | 0.835153 | 0.818001 | 0.881361 |
| 7 | 1.308004 | 0.197240 | 0.975610 | 0.973973 | 0.429207 | 0.362564 | 0.710059 | 0.678899 | 0.710059 | 0.678899 | 0.839641 | 0.826542 | 0.888776 | 0.874587 |
| 8 | 0.431425 | 0.175325 | 0.938272 | 0.976744 | 0.334899 | 0.392373 | 0.656805 | 0.688729 | 0.656805 | 0.688729 | 0.747491 | 0.840343 | 0.815929 | 0.881935 |
| 9 | 0.378342 | 0.136061 | 0.987654 | 0.974008 | 0.283922 | 0.372673 | 0.627219 | 0.681520 | 0.627219 | 0.681520 | 0.801963 | 0.833317 | 0.858249 | 0.879394 |
precision recall f1-score support
Colorectum 0.47 0.72 0.57 346
Lung 0.20 0.01 0.02 94
Breast 0.26 0.14 0.18 174
Pancreas 0.67 0.32 0.43 82
Ovary 0.72 0.54 0.62 48
Esophagus 0.36 0.10 0.15 41
Liver 0.59 0.34 0.43 38
Stomach 0.15 0.05 0.07 60
Healthy 0.84 0.97 0.90 812
accuracy 0.67 1695
macro avg 0.47 0.35 0.38 1695
weighted avg 0.62 0.67 0.62 1695
Cancer samples correctly classified (sensitivity): 735, out of 883 (83.2%)
Total precision on cancer samples: 96.7%
| Scores | Std | |
|---|---|---|
| Specificity (med) | 0.9753 | 0.0184 |
| Sensitivity (med) | 0.3612 | 0.0524 |
| Sensitivity weighted (med) | 0.6588 | 0.0524 |
| AUC (med) | 0.8017 | 0.0359 |
| Specificity (mean) | 0.9692 | 0.0184 |
| Sensitivity (mean) | 0.3574 | 0.0524 |
| Sensitivity weighted (mean) | 0.6690 | 0.0524 |
| AUC (mean) | 0.7935 | 0.0359 |
75.8% <= AUC <= 82.9% ============================================================================ ================================== KNN ================================== ============================================================================
| fit_time | score_time | test_specificity | train_specificity | test_sensitivity | train_sensitivity | test_sensitivity_w | train_sensitivity_w | test_accuracy | train_accuracy | test_roc_auc | train_roc_auc | test_roc_auc_w | train_roc_auc_w | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.017157 | 0.268757 | 0.987654 | 0.990424 | 0.240207 | 0.383435 | 0.652941 | 0.704918 | 0.652941 | 0.704918 | 0.640578 | 0.911972 | 0.725431 | 0.933243 |
| 1 | 0.041748 | 0.281546 | 0.962963 | 0.989056 | 0.341343 | 0.374305 | 0.647059 | 0.704262 | 0.647059 | 0.704262 | 0.767152 | 0.908244 | 0.811420 | 0.930262 |
| 2 | 0.039525 | 0.303877 | 0.987654 | 0.986320 | 0.280164 | 0.391695 | 0.629412 | 0.708197 | 0.629412 | 0.708197 | 0.682557 | 0.913319 | 0.755669 | 0.934746 |
| 3 | 0.028036 | 0.289039 | 0.962963 | 0.989056 | 0.327872 | 0.369413 | 0.647059 | 0.700984 | 0.647059 | 0.700984 | 0.739853 | 0.911719 | 0.800021 | 0.932909 |
| 4 | 0.030954 | 0.287276 | 1.000000 | 0.986320 | 0.326984 | 0.368343 | 0.670588 | 0.695738 | 0.670588 | 0.695738 | 0.775782 | 0.906782 | 0.824650 | 0.929575 |
| 5 | 0.031721 | 0.284987 | 0.962963 | 0.987688 | 0.312043 | 0.372435 | 0.627219 | 0.703145 | 0.627219 | 0.703145 | 0.693156 | 0.909997 | 0.757648 | 0.931818 |
| 6 | 0.031226 | 0.312858 | 0.975610 | 0.989041 | 0.363920 | 0.400916 | 0.674556 | 0.711009 | 0.674556 | 0.711009 | 0.738752 | 0.912874 | 0.794362 | 0.933583 |
| 7 | 0.032559 | 0.301574 | 0.987805 | 0.990411 | 0.347065 | 0.351535 | 0.674556 | 0.694626 | 0.674556 | 0.694626 | 0.724642 | 0.907590 | 0.791590 | 0.928981 |
| 8 | 0.050749 | 0.232800 | 0.987654 | 0.990424 | 0.299100 | 0.368662 | 0.650888 | 0.702490 | 0.650888 | 0.702490 | 0.753971 | 0.906287 | 0.809842 | 0.929947 |
| 9 | 0.048501 | 0.225158 | 0.987654 | 0.989056 | 0.210774 | 0.383440 | 0.621302 | 0.701835 | 0.621302 | 0.701835 | 0.645107 | 0.912608 | 0.734502 | 0.931815 |
precision recall f1-score support
Colorectum 0.46 0.67 0.54 346
Lung 0.28 0.12 0.17 94
Breast 0.28 0.09 0.14 174
Pancreas 0.50 0.20 0.28 82
Ovary 0.88 0.29 0.44 48
Esophagus 0.09 0.02 0.04 41
Liver 0.71 0.26 0.38 38
Stomach 0.32 0.10 0.15 60
Healthy 0.79 0.98 0.88 812
accuracy 0.65 1695
macro avg 0.48 0.30 0.34 1695
weighted avg 0.60 0.65 0.60 1695
Cancer samples correctly classified (sensitivity): 674, out of 883 (76.3%)
Total precision on cancer samples: 97.7%
| Scores | Std | |
|---|---|---|
| Specificity (med) | 0.9877 | 0.0126 |
| Sensitivity (med) | 0.3195 | 0.0462 |
| Sensitivity weighted (med) | 0.6490 | 0.0462 |
| AUC (med) | 0.7317 | 0.0460 |
| Specificity (mean) | 0.9803 | 0.0126 |
| Sensitivity (mean) | 0.3049 | 0.0462 |
| Sensitivity weighted (mean) | 0.6496 | 0.0462 |
| AUC (mean) | 0.7162 | 0.0460 |
67.0% <= AUC <= 76.2% ============================================================================ ================================== SVC ================================== ============================================================================
| fit_time | score_time | test_specificity | train_specificity | test_sensitivity | train_sensitivity | test_sensitivity_w | train_sensitivity_w | test_accuracy | train_accuracy | test_roc_auc | train_roc_auc | test_roc_auc_w | train_roc_auc_w | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 9.271193 | 0.165777 | 0.975309 | 0.980848 | 0.308985 | 0.387043 | 0.670588 | 0.700328 | 0.670588 | 0.700328 | 0.719814 | 0.800253 | 0.817728 | 0.867242 |
| 1 | 13.687822 | 0.194352 | 0.950617 | 0.980848 | 0.359460 | 0.380049 | 0.664706 | 0.696393 | 0.664706 | 0.696393 | 0.783924 | 0.788615 | 0.852976 | 0.860494 |
| 2 | 11.292676 | 0.147028 | 0.975309 | 0.980848 | 0.408134 | 0.369945 | 0.694118 | 0.692459 | 0.694118 | 0.692459 | 0.763535 | 0.787219 | 0.846949 | 0.858865 |
| 3 | 12.677872 | 0.206015 | 0.962963 | 0.982216 | 0.363905 | 0.383676 | 0.670588 | 0.699672 | 0.670588 | 0.699672 | 0.764920 | 0.800707 | 0.846034 | 0.868131 |
| 4 | 13.111074 | 0.169663 | 1.000000 | 0.978112 | 0.413866 | 0.373277 | 0.717647 | 0.691803 | 0.717647 | 0.691803 | 0.766038 | 0.792127 | 0.845176 | 0.859363 |
| 5 | 11.319724 | 0.185836 | 0.975309 | 0.980848 | 0.383531 | 0.373041 | 0.692308 | 0.692005 | 0.692308 | 0.692005 | 0.759615 | 0.796054 | 0.844238 | 0.862657 |
| 6 | 11.415057 | 0.156164 | 0.939024 | 0.982192 | 0.275723 | 0.380277 | 0.633136 | 0.699869 | 0.633136 | 0.699869 | 0.730801 | 0.794580 | 0.819408 | 0.861953 |
| 7 | 11.493093 | 0.117149 | 0.975610 | 0.979452 | 0.376375 | 0.370307 | 0.704142 | 0.697903 | 0.704142 | 0.697903 | 0.804029 | 0.784533 | 0.875353 | 0.857773 |
| 8 | 8.366803 | 0.096110 | 0.962963 | 0.979480 | 0.327838 | 0.383234 | 0.650888 | 0.695937 | 0.650888 | 0.695937 | 0.725142 | 0.809979 | 0.827201 | 0.869300 |
| 9 | 6.980436 | 0.095542 | 1.000000 | 0.979480 | 0.316885 | 0.375596 | 0.680473 | 0.690039 | 0.680473 | 0.690039 | 0.775761 | 0.785935 | 0.848404 | 0.856837 |
precision recall f1-score support
Colorectum 0.47 0.77 0.58 346
Lung 0.00 0.00 0.00 94
Breast 0.24 0.12 0.16 174
Pancreas 0.71 0.41 0.52 82
Ovary 0.76 0.52 0.62 48
Esophagus 0.00 0.00 0.00 41
Liver 0.78 0.37 0.50 38
Stomach 0.00 0.00 0.00 60
Healthy 0.84 0.97 0.90 812
accuracy 0.68 1695
macro avg 0.42 0.35 0.36 1695
weighted avg 0.60 0.68 0.62 1695
Cancer samples correctly classified (sensitivity): 733, out of 883 (83.0%)
Total precision on cancer samples: 97.0%
| Scores | Std | |
|---|---|---|
| Specificity (med) | 0.9753 | 0.0182 |
| Sensitivity (med) | 0.3617 | 0.0427 |
| Sensitivity weighted (med) | 0.6755 | 0.0427 |
| AUC (med) | 0.7642 | 0.0255 |
| Specificity (mean) | 0.9717 | 0.0182 |
| Sensitivity (mean) | 0.3535 | 0.0427 |
| Sensitivity weighted (mean) | 0.6779 | 0.0427 |
| AUC (mean) | 0.7594 | 0.0255 |
73.4% <= AUC <= 78.5% ============================================================================ ================================== RandomForest ================================== ============================================================================
| fit_time | score_time | test_specificity | train_specificity | test_sensitivity | train_sensitivity | test_sensitivity_w | train_sensitivity_w | test_accuracy | train_accuracy | test_roc_auc | train_roc_auc | test_roc_auc_w | train_roc_auc_w | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3.465852 | 0.478438 | 0.987654 | 0.990424 | 0.349290 | 0.496787 | 0.717647 | 0.791475 | 0.717647 | 0.791475 | 0.829597 | 0.962042 | 0.902089 | 0.975112 |
| 1 | 3.561345 | 0.481226 | 0.975309 | 0.993160 | 0.425916 | 0.507788 | 0.735294 | 0.800000 | 0.735294 | 0.800000 | 0.868127 | 0.963594 | 0.912853 | 0.976412 |
| 2 | 3.436283 | 0.455013 | 0.975309 | 0.990424 | 0.324007 | 0.486662 | 0.700000 | 0.791475 | 0.700000 | 0.791475 | 0.839222 | 0.963698 | 0.898908 | 0.976786 |
| 3 | 3.507925 | 0.471049 | 0.975309 | 0.993160 | 0.403559 | 0.489401 | 0.705882 | 0.796721 | 0.705882 | 0.796721 | 0.809805 | 0.962493 | 0.880390 | 0.975647 |
| 4 | 3.178692 | 0.421304 | 1.000000 | 0.993160 | 0.436461 | 0.497142 | 0.747059 | 0.798033 | 0.747059 | 0.798033 | 0.835632 | 0.963411 | 0.900297 | 0.976583 |
| 5 | 3.327915 | 0.469575 | 0.987654 | 0.995896 | 0.391880 | 0.473010 | 0.733728 | 0.787680 | 0.733728 | 0.787680 | 0.864133 | 0.961937 | 0.914403 | 0.975602 |
| 6 | 3.202760 | 0.411914 | 0.975610 | 0.993151 | 0.385525 | 0.486886 | 0.745562 | 0.792923 | 0.745562 | 0.792923 | 0.845101 | 0.957251 | 0.909847 | 0.972996 |
| 7 | 3.179191 | 0.417778 | 1.000000 | 0.994521 | 0.413399 | 0.499518 | 0.727811 | 0.795544 | 0.727811 | 0.795544 | 0.866380 | 0.964448 | 0.920220 | 0.977238 |
| 8 | 2.453398 | 0.226891 | 0.962963 | 0.994528 | 0.402003 | 0.466002 | 0.739645 | 0.785059 | 0.739645 | 0.785059 | 0.852585 | 0.959901 | 0.915935 | 0.973979 |
| 9 | 2.485579 | 0.225725 | 0.987654 | 0.990424 | 0.288389 | 0.486365 | 0.704142 | 0.792267 | 0.704142 | 0.792267 | 0.831889 | 0.961266 | 0.899945 | 0.973756 |
precision recall f1-score support
Colorectum 0.48 0.89 0.62 346
Lung 0.00 0.00 0.00 94
Breast 0.47 0.23 0.31 174
Pancreas 0.70 0.59 0.64 82
Ovary 0.79 0.46 0.58 48
Esophagus 0.00 0.00 0.00 41
Liver 0.85 0.29 0.43 38
Stomach 0.00 0.00 0.00 60
Healthy 0.93 0.98 0.96 812
accuracy 0.72 1695
macro avg 0.47 0.38 0.39 1695
weighted avg 0.67 0.72 0.67 1695
Cancer samples correctly classified (sensitivity): 824, out of 883 (93.3%)
Total precision on cancer samples: 98.1%
| Scores | Std | |
|---|---|---|
| Specificity (med) | 0.9816 | 0.0113 |
| Sensitivity (med) | 0.3969 | 0.0448 |
| Sensitivity weighted (med) | 0.7308 | 0.0448 |
| AUC (med) | 0.8422 | 0.0178 |
| Specificity (mean) | 0.9827 | 0.0113 |
| Sensitivity (mean) | 0.3820 | 0.0448 |
| Sensitivity weighted (mean) | 0.7257 | 0.0448 |
| AUC (mean) | 0.8442 | 0.0178 |
82.6% <= AUC <= 86.2% ============================================================================ ================================== GradientBoosting ================================== ============================================================================
| fit_time | score_time | test_specificity | train_specificity | test_sensitivity | train_sensitivity | test_sensitivity_w | train_sensitivity_w | test_accuracy | train_accuracy | test_roc_auc | train_roc_auc | test_roc_auc_w | train_roc_auc_w | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 36.271850 | 0.205969 | 1.000000 | 1.0 | 0.452778 | 1.0 | 0.776471 | 1.0 | 0.776471 | 1.0 | 0.863695 | 1.0 | 0.927096 | 1.0 |
| 1 | 35.928520 | 0.252532 | 0.975309 | 1.0 | 0.530502 | 1.0 | 0.782353 | 1.0 | 0.782353 | 1.0 | 0.878202 | 1.0 | 0.923788 | 1.0 |
| 2 | 35.659248 | 0.212248 | 1.000000 | 1.0 | 0.464379 | 1.0 | 0.758824 | 1.0 | 0.758824 | 1.0 | 0.863632 | 1.0 | 0.920246 | 1.0 |
| 3 | 36.085382 | 0.207619 | 0.975309 | 1.0 | 0.463480 | 1.0 | 0.717647 | 1.0 | 0.717647 | 1.0 | 0.853949 | 1.0 | 0.913588 | 1.0 |
| 4 | 35.941035 | 0.381690 | 0.987654 | 1.0 | 0.510284 | 1.0 | 0.770588 | 1.0 | 0.770588 | 1.0 | 0.869421 | 1.0 | 0.921828 | 1.0 |
| 5 | 36.229255 | 0.214372 | 0.987654 | 1.0 | 0.448602 | 1.0 | 0.763314 | 1.0 | 0.763314 | 1.0 | 0.851741 | 1.0 | 0.918684 | 1.0 |
| 6 | 36.079071 | 0.217322 | 0.963415 | 1.0 | 0.518302 | 1.0 | 0.769231 | 1.0 | 0.769231 | 1.0 | 0.888599 | 1.0 | 0.936424 | 1.0 |
| 7 | 36.376711 | 0.240045 | 1.000000 | 1.0 | 0.542665 | 1.0 | 0.792899 | 1.0 | 0.792899 | 1.0 | 0.869589 | 1.0 | 0.922777 | 1.0 |
| 8 | 22.563728 | 0.117003 | 0.975309 | 1.0 | 0.516665 | 1.0 | 0.786982 | 1.0 | 0.786982 | 1.0 | 0.881732 | 1.0 | 0.934835 | 1.0 |
| 9 | 22.274240 | 0.117351 | 0.987654 | 1.0 | 0.453385 | 1.0 | 0.798817 | 1.0 | 0.798817 | 1.0 | 0.857179 | 1.0 | 0.925505 | 1.0 |
precision recall f1-score support
Colorectum 0.63 0.79 0.70 346
Lung 0.51 0.40 0.45 94
Breast 0.53 0.55 0.54 174
Pancreas 0.67 0.59 0.62 82
Ovary 0.70 0.65 0.67 48
Esophagus 0.07 0.02 0.04 41
Liver 0.50 0.32 0.39 38
Stomach 0.21 0.12 0.15 60
Healthy 0.98 0.98 0.98 812
accuracy 0.77 1695
macro avg 0.53 0.49 0.50 1695
weighted avg 0.75 0.77 0.76 1695
Cancer samples correctly classified (sensitivity): 865, out of 883 (98.0%)
Total precision on cancer samples: 98.5%
| Scores | Std | |
|---|---|---|
| Specificity (med) | 0.9877 | 0.0120 |
| Sensitivity (med) | 0.4873 | 0.0348 |
| Sensitivity weighted (med) | 0.7735 | 0.0348 |
| AUC (med) | 0.8666 | 0.0116 |
| Specificity (mean) | 0.9852 | 0.0120 |
| Sensitivity (mean) | 0.4901 | 0.0348 |
| Sensitivity weighted (mean) | 0.7717 | 0.0348 |
| AUC (mean) | 0.8678 | 0.0116 |
85.6% <= AUC <= 87.9% ============================================================================ ================================== CatBoost ================================== ============================================================================
| fit_time | score_time | test_specificity | train_specificity | test_sensitivity | train_sensitivity | test_sensitivity_w | train_sensitivity_w | test_accuracy | train_accuracy | test_roc_auc | train_roc_auc | test_roc_auc_w | train_roc_auc_w | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 6.685266 | 0.230657 | 1.000000 | 0.994528 | 0.468827 | 0.870578 | 0.776471 | 0.942295 | 0.776471 | 0.942295 | 0.868830 | 0.993598 | 0.927381 | 0.995802 |
| 1 | 6.793162 | 0.205803 | 0.975309 | 0.997264 | 0.513262 | 0.854901 | 0.735294 | 0.937705 | 0.735294 | 0.937705 | 0.903190 | 0.993823 | 0.934909 | 0.996277 |
| 2 | 6.833882 | 0.228156 | 1.000000 | 0.995896 | 0.538741 | 0.859392 | 0.782353 | 0.940984 | 0.782353 | 0.940984 | 0.855502 | 0.994074 | 0.914524 | 0.996330 |
| 3 | 6.755258 | 0.231279 | 0.975309 | 0.997264 | 0.555520 | 0.876435 | 0.764706 | 0.943607 | 0.764706 | 0.943607 | 0.870964 | 0.994930 | 0.923938 | 0.996578 |
| 4 | 5.672084 | 0.265865 | 1.000000 | 0.995896 | 0.545464 | 0.881863 | 0.782353 | 0.946885 | 0.782353 | 0.946885 | 0.882389 | 0.993918 | 0.926260 | 0.996252 |
| 5 | 5.744078 | 0.187460 | 0.987654 | 0.997264 | 0.496476 | 0.873081 | 0.769231 | 0.942333 | 0.769231 | 0.942333 | 0.874682 | 0.993820 | 0.933655 | 0.996140 |
| 6 | 5.690531 | 0.200717 | 0.975610 | 0.994521 | 0.554588 | 0.841723 | 0.798817 | 0.931848 | 0.798817 | 0.931848 | 0.901192 | 0.993390 | 0.940711 | 0.995752 |
| 7 | 5.762475 | 0.220067 | 1.000000 | 0.994521 | 0.555192 | 0.879198 | 0.816568 | 0.944954 | 0.816568 | 0.944954 | 0.898274 | 0.994264 | 0.942662 | 0.996357 |
| 8 | 3.278445 | 0.099915 | 0.975309 | 0.994528 | 0.537525 | 0.860347 | 0.792899 | 0.932503 | 0.792899 | 0.932503 | 0.889384 | 0.993739 | 0.939651 | 0.995906 |
| 9 | 3.303016 | 0.102580 | 0.987654 | 0.994528 | 0.508269 | 0.854740 | 0.786982 | 0.934469 | 0.786982 | 0.934469 | 0.890780 | 0.993808 | 0.941164 | 0.995994 |
precision recall f1-score support
Colorectum 0.64 0.82 0.72 346
Lung 0.58 0.44 0.50 94
Breast 0.53 0.45 0.49 174
Pancreas 0.70 0.66 0.68 82
Ovary 0.84 0.79 0.82 48
Esophagus 0.17 0.10 0.13 41
Liver 0.48 0.37 0.42 38
Stomach 0.33 0.12 0.17 60
Healthy 0.96 0.99 0.97 812
accuracy 0.78 1695
macro avg 0.58 0.53 0.54 1695
weighted avg 0.76 0.78 0.77 1695
Cancer samples correctly classified (sensitivity): 850, out of 883 (96.3%)
Total precision on cancer samples: 98.8%
| Scores | Std | |
|---|---|---|
| Specificity (med) | 0.9877 | 0.011 |
| Sensitivity (med) | 0.5381 | 0.028 |
| Sensitivity weighted (med) | 0.7824 | 0.028 |
| AUC (med) | 0.8859 | 0.015 |
| Specificity (mean) | 0.9877 | 0.011 |
| Sensitivity (mean) | 0.5274 | 0.028 |
| Sensitivity weighted (mean) | 0.7806 | 0.028 |
| AUC (mean) | 0.8835 | 0.015 |
86.8% <= AUC <= 89.8% ============================================================================ ================================== LightGBM ================================== ============================================================================
| fit_time | score_time | test_specificity | train_specificity | test_sensitivity | train_sensitivity | test_sensitivity_w | train_sensitivity_w | test_accuracy | train_accuracy | test_roc_auc | train_roc_auc | test_roc_auc_w | train_roc_auc_w | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.941664 | 0.900762 | 0.987654 | 1.0 | 0.485533 | 1.0 | 0.788235 | 1.0 | 0.788235 | 1.0 | 0.865673 | 1.0 | 0.925616 | 1.0 |
| 1 | 2.961551 | 0.941683 | 0.962963 | 1.0 | 0.562992 | 1.0 | 0.770588 | 1.0 | 0.770588 | 1.0 | 0.901058 | 1.0 | 0.932301 | 1.0 |
| 2 | 2.970062 | 0.929272 | 1.000000 | 1.0 | 0.410279 | 1.0 | 0.735294 | 1.0 | 0.735294 | 1.0 | 0.851729 | 1.0 | 0.912290 | 1.0 |
| 3 | 3.158547 | 0.851240 | 0.987654 | 1.0 | 0.471177 | 1.0 | 0.752941 | 1.0 | 0.752941 | 1.0 | 0.849463 | 1.0 | 0.914262 | 1.0 |
| 4 | 2.567770 | 0.362596 | 0.987654 | 1.0 | 0.458269 | 1.0 | 0.741176 | 1.0 | 0.741176 | 1.0 | 0.875073 | 1.0 | 0.920434 | 1.0 |
| 5 | 2.622387 | 0.414626 | 0.987654 | 1.0 | 0.489431 | 1.0 | 0.781065 | 1.0 | 0.781065 | 1.0 | 0.867625 | 1.0 | 0.928658 | 1.0 |
| 6 | 2.660537 | 0.421098 | 0.975610 | 1.0 | 0.537631 | 1.0 | 0.775148 | 1.0 | 0.775148 | 1.0 | 0.895997 | 1.0 | 0.936130 | 1.0 |
| 7 | 2.670143 | 0.479897 | 1.000000 | 1.0 | 0.494208 | 1.0 | 0.781065 | 1.0 | 0.781065 | 1.0 | 0.868566 | 1.0 | 0.921236 | 1.0 |
| 8 | 2.029359 | 0.196047 | 0.975309 | 1.0 | 0.524816 | 1.0 | 0.792899 | 1.0 | 0.792899 | 1.0 | 0.866845 | 1.0 | 0.929546 | 1.0 |
| 9 | 1.944411 | 0.248451 | 0.975309 | 1.0 | 0.462453 | 1.0 | 0.775148 | 1.0 | 0.775148 | 1.0 | 0.864871 | 1.0 | 0.929427 | 1.0 |
precision recall f1-score support
Colorectum 0.62 0.81 0.70 346
Lung 0.52 0.36 0.43 94
Breast 0.52 0.52 0.52 174
Pancreas 0.71 0.59 0.64 82
Ovary 0.80 0.67 0.73 48
Esophagus 0.14 0.05 0.07 41
Liver 0.46 0.29 0.35 38
Stomach 0.24 0.13 0.17 60
Healthy 0.97 0.98 0.98 812
accuracy 0.77 1695
macro avg 0.55 0.49 0.51 1695
weighted avg 0.75 0.77 0.76 1695
Cancer samples correctly classified (sensitivity): 861, out of 883 (97.5%)
Total precision on cancer samples: 98.5%
| Scores | Std | |
|---|---|---|
| Specificity (med) | 0.9877 | 0.0111 |
| Sensitivity (med) | 0.4875 | 0.0416 |
| Sensitivity weighted (med) | 0.7751 | 0.0416 |
| AUC (med) | 0.8672 | 0.0157 |
| Specificity (mean) | 0.9840 | 0.0111 |
| Sensitivity (mean) | 0.4897 | 0.0416 |
| Sensitivity weighted (mean) | 0.7694 | 0.0416 |
| AUC (mean) | 0.8707 | 0.0157 |
85.5% <= AUC <= 88.6% ============================================================================ ================================== XGBoost ================================== ============================================================================
| fit_time | score_time | test_specificity | train_specificity | test_sensitivity | train_sensitivity | test_sensitivity_w | train_sensitivity_w | test_accuracy | train_accuracy | test_roc_auc | train_roc_auc | test_roc_auc_w | train_roc_auc_w | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.826599 | 0.256482 | 1.000000 | 1.0 | 0.486905 | 1.000000 | 0.794118 | 1.000000 | 0.794118 | 1.000000 | 0.875890 | 1.0 | 0.931837 | 1.0 |
| 1 | 2.803414 | 0.248255 | 0.975309 | 1.0 | 0.535396 | 0.999643 | 0.800000 | 0.999344 | 0.800000 | 0.999344 | 0.903352 | 1.0 | 0.936920 | 1.0 |
| 2 | 2.799333 | 0.295478 | 1.000000 | 1.0 | 0.496335 | 1.000000 | 0.758824 | 1.000000 | 0.758824 | 1.000000 | 0.848834 | 1.0 | 0.914702 | 1.0 |
| 3 | 2.814667 | 0.365639 | 0.962963 | 1.0 | 0.495208 | 1.000000 | 0.758824 | 1.000000 | 0.758824 | 1.000000 | 0.881525 | 1.0 | 0.933310 | 1.0 |
| 4 | 2.851409 | 0.214412 | 0.987654 | 1.0 | 0.493267 | 1.000000 | 0.758824 | 1.000000 | 0.758824 | 1.000000 | 0.884278 | 1.0 | 0.929192 | 1.0 |
| 5 | 2.808505 | 0.241189 | 0.987654 | 1.0 | 0.481495 | 0.999643 | 0.781065 | 0.999345 | 0.781065 | 0.999345 | 0.885955 | 1.0 | 0.938447 | 1.0 |
| 6 | 2.849863 | 0.259715 | 0.975610 | 1.0 | 0.504044 | 1.000000 | 0.769231 | 1.000000 | 0.769231 | 1.000000 | 0.899207 | 1.0 | 0.943568 | 1.0 |
| 7 | 2.866695 | 0.289637 | 1.000000 | 1.0 | 0.513072 | 1.000000 | 0.804734 | 1.000000 | 0.804734 | 1.000000 | 0.904886 | 1.0 | 0.941256 | 1.0 |
| 8 | 1.839791 | 0.127954 | 1.000000 | 1.0 | 0.537364 | 1.000000 | 0.804734 | 1.000000 | 0.804734 | 1.000000 | 0.903801 | 1.0 | 0.948841 | 1.0 |
| 9 | 1.714739 | 0.159592 | 0.987654 | 1.0 | 0.428331 | 1.000000 | 0.775148 | 1.000000 | 0.775148 | 1.000000 | 0.893021 | 1.0 | 0.944882 | 1.0 |
precision recall f1-score support
Colorectum 0.61 0.85 0.71 346
Lung 0.52 0.34 0.41 94
Breast 0.53 0.53 0.53 174
Pancreas 0.74 0.59 0.65 82
Ovary 0.81 0.71 0.76 48
Esophagus 0.22 0.05 0.08 41
Liver 0.55 0.32 0.40 38
Stomach 0.30 0.10 0.15 60
Healthy 0.98 0.99 0.98 812
accuracy 0.78 1695
macro avg 0.58 0.50 0.52 1695
weighted avg 0.76 0.78 0.76 1695
Cancer samples correctly classified (sensitivity): 866, out of 883 (98.1%)
Total precision on cancer samples: 98.9%
| Scores | Std | |
|---|---|---|
| Specificity (med) | 0.9877 | 0.0123 |
| Sensitivity (med) | 0.4958 | 0.0291 |
| Sensitivity weighted (med) | 0.7781 | 0.0291 |
| AUC (med) | 0.8895 | 0.0163 |
| Specificity (mean) | 0.9877 | 0.0123 |
| Sensitivity (mean) | 0.4971 | 0.0291 |
| Sensitivity weighted (mean) | 0.7805 | 0.0291 |
| AUC (mean) | 0.8881 | 0.0163 |
87.2% <= AUC <= 90.4%
Plot shap values for supported classifiers.
# Select predictors and target variable
X = sh_data[numerical_features]
Y = sh_data['Tumor type']
# Split into train and test sets
trainX, testX, trainY, testY = train_test_split(X, Y, test_size=0.2, random_state=89, stratify=Y)
# Transform the train and test set in the same way as in the pipeline
pt = PercentileTransformer()
sc = StandardScaler()
pt.fit(trainX, trainY)
sc.fit(trainX)
testX = pt.transform(testX)
testX = sc.transform(testX)
# Create dataframe so feature names are shown
testX = pd.DataFrame(testX, columns=numerical_features)
for name in names[5:]:
print('\n\n============================================================================')
print(f'================================== {name} ==================================')
print('============================================================================')
shap_values = shap.TreeExplainer(best_models[name],
feature_perturbation="tree_path_dependent").shap_values(testX)[1]
shap.summary_plot(shap_values, testX)
============================================================================ ================================== CatBoost ================================== ============================================================================
Setting feature_perturbation = "tree_path_dependent" because no background data was given.
============================================================================ ================================== LightGBM ================================== ============================================================================
============================================================================ ================================== XGBoost ================================== ============================================================================
print(f'\n================================== {names[3]} ==================================')
# Plot feature importance
display(plot_feature_importance(best_models[names[3]], X.columns, figsize=(8,4)))
================================== RandomForest ==================================
None
print(f'\n================================== {names[4]} ==================================')
# Plot feature importance
display(plot_feature_importance(best_models[names[4]], X.columns, figsize=(8,4)))
================================== GradientBoosting ==================================
None
print(f'\n================================== {names[5]} ==================================')
# Plot feature importance
display(plot_feature_importance(best_models[names[5]], X.columns, figsize=(8,4)))
================================== CatBoost ==================================
None
print(f'\n================================== {names[6]} ==================================')
# Plot feature importance
display(plot_feature_importance(best_models[names[6]], X.columns, figsize=(8,4)))
================================== LightGBM ==================================
None
print(f'\n================================== {names[7]} ==================================')
# Plot feature importance
display(plot_feature_importance(best_models[names[7]], X.columns, figsize=(8,4)))
================================== XGBoost ==================================
None
Highest specificity is obtained by the CatBoost and XGBoost classifiers at 99%. In general, specificity is high though with all other models scoring between 97% and 98%.
When also considering sensitivity XGBoost is the most performant with 98% sensitivity (and 99% specificity).
Try VotingClassifier with above estimators. Use voting="soft" as that often works better for well calibrated models. More weight to more performant models.
%%time
# Select predictors and target variable
X = sh_data[numerical_features]
Y = sh_data['Tumor type']
# Split into train and test sets
trainX, testX, trainY, testY = train_test_split(X, Y, test_size=0.2, random_state=89, stratify=Y)
# Create Voting Classifier with four estimators
vtclf = ensemble.VotingClassifier(estimators=[e for e in zip(best_models.keys(),
best_models.values())],
voting='soft', weights=[1, 1, 1, 2, 3, 4, 3, 4], n_jobs=-1)
# Create pipeline
vtclf_pipeline = Pipeline(steps=[('numerical_pipeline', numerical_pipeline),
('vtclf', vtclf)], verbose=3)
vtclf_pipeline.fit(trainX, trainY)
# Cross validate above parameter tuning
cvScores_voting = crossVal(vtclf_pipeline, X, Y, cv_folds=10)
[Pipeline] (step 1 of 2) Processing numerical_pipeline, total= 0.0s [Pipeline] ............. (step 2 of 2) Processing vtclf, total= 13.9s
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None. A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 2.8min remaining: 1.2min [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 3.3min finished
Model report Cross Validated scores Sensitivity weighted (test): 0.7806 Sensitivity (test): 0.4948 AUC (train): 1.00 AUC (test): 0.8824 CPU times: user 650 ms, sys: 164 ms, total: 814 ms Wall time: 3min 35s
pd.DataFrame(cvScores_voting)
| fit_time | score_time | test_specificity | train_specificity | test_sensitivity | train_sensitivity | test_sensitivity_w | train_sensitivity_w | test_accuracy | train_accuracy | test_roc_auc | train_roc_auc | test_roc_auc_w | train_roc_auc_w | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 70.541075 | 1.042007 | 1.000000 | 1.0 | 0.430732 | 0.999288 | 0.776471 | 0.999344 | 0.776471 | 0.999344 | 0.852914 | 1.0 | 0.922169 | 1.0 |
| 1 | 69.114949 | 1.054669 | 0.975309 | 1.0 | 0.533676 | 1.000000 | 0.788235 | 1.000000 | 0.788235 | 1.000000 | 0.902202 | 1.0 | 0.933572 | 1.0 |
| 2 | 67.904180 | 1.078009 | 1.000000 | 1.0 | 0.449378 | 1.000000 | 0.752941 | 1.000000 | 0.752941 | 1.000000 | 0.856476 | 1.0 | 0.913739 | 1.0 |
| 3 | 63.935464 | 1.028889 | 0.987654 | 1.0 | 0.533502 | 1.000000 | 0.764706 | 1.000000 | 0.764706 | 1.000000 | 0.868720 | 1.0 | 0.924784 | 1.0 |
| 4 | 70.667051 | 0.926147 | 0.987654 | 1.0 | 0.510097 | 0.999292 | 0.776471 | 0.999344 | 0.776471 | 0.999344 | 0.882290 | 1.0 | 0.929521 | 1.0 |
| 5 | 70.613867 | 0.866615 | 0.987654 | 1.0 | 0.482729 | 1.000000 | 0.781065 | 1.000000 | 0.781065 | 1.000000 | 0.879855 | 1.0 | 0.935037 | 1.0 |
| 6 | 69.940644 | 0.852014 | 0.975610 | 1.0 | 0.514283 | 1.000000 | 0.781065 | 1.000000 | 0.781065 | 1.000000 | 0.891670 | 1.0 | 0.938659 | 1.0 |
| 7 | 68.778893 | 0.919429 | 1.000000 | 1.0 | 0.519063 | 1.000000 | 0.804734 | 1.000000 | 0.804734 | 1.000000 | 0.909538 | 1.0 | 0.943737 | 1.0 |
| 8 | 30.285260 | 0.665442 | 1.000000 | 1.0 | 0.543174 | 1.000000 | 0.798817 | 1.000000 | 0.798817 | 1.000000 | 0.883178 | 1.0 | 0.939845 | 1.0 |
| 9 | 30.360282 | 0.655942 | 0.987654 | 1.0 | 0.431598 | 1.000000 | 0.781065 | 1.000000 | 0.781065 | 1.000000 | 0.897158 | 1.0 | 0.947123 | 1.0 |
# Make predictions with the Voting Classifier on the entire dataset
preds_voting = cross_val_predict(vtclf_pipeline, X, Y, cv=10, verbose=3, n_jobs=-1)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 2.2min remaining: 55.4s [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 2.8min finished
# Plot Confusion Matrix
plot_confusion_matrix(Y, preds_voting, target_names=[i for i in cancers[:9]],
title='Confusion Matrix')
# Plot sensitivities
plot_sensitivities(preds_voting, Y, title='Sensitivity per Cancer Type')
cSamples = sum((preds_voting != 9) & (Y != 9))
ccSamples = sum(preds_voting != 9)
tot = sum(Y != 9)
print(f'Cancer samples correctly classified (sensitivity): {cSamples}, out of {tot} ({cSamples/tot:.1%})')
print(f'Total precision on cancer samples: {cSamples/ccSamples:.1%}')
Cancer samples correctly classified (sensitivity): 862, out of 883 (97.6%) Total precision on cancer samples: 99.1%
print(classification_report(Y, preds_voting, target_names=cancers[:9]))
precision recall f1-score support
Colorectum 0.60 0.85 0.70 346
Lung 0.54 0.34 0.42 94
Breast 0.56 0.52 0.54 174
Pancreas 0.71 0.59 0.64 82
Ovary 0.79 0.69 0.73 48
Esophagus 0.12 0.02 0.04 41
Liver 0.57 0.34 0.43 38
Stomach 0.40 0.10 0.16 60
Healthy 0.97 0.99 0.98 812
accuracy 0.78 1695
macro avg 0.58 0.49 0.52 1695
weighted avg 0.76 0.78 0.76 1695
# Print performance
performance_voting = cv_score_summary(cvScores_voting)
display(performance_voting)
# Calculate AUC with standard deviations
med = performance_voting.loc['AUC (mean)', 'Scores']
std = performance_voting.loc['AUC (mean)', 'Std']
print("{:.1%} <= AUC <= {:.1%}".format(med-std, med+std))
| Scores | Std | |
|---|---|---|
| Specificity (med) | 0.9877 | 0.0092 |
| Sensitivity (med) | 0.5122 | 0.0411 |
| Sensitivity weighted (med) | 0.7811 | 0.0411 |
| AUC (med) | 0.8827 | 0.0178 |
| Specificity (mean) | 0.9902 | 0.0092 |
| Sensitivity (mean) | 0.4948 | 0.0411 |
| Sensitivity weighted (mean) | 0.7806 | 0.0411 |
| AUC (mean) | 0.8824 | 0.0178 |
86.5% <= AUC <= 90.0%
A VotingClassifier, with higher weights given to the most performant individual estimators CatBoost and XGBoost, improves specificity slightly more to more than 99%. Only 8 out of 812 cancer samples are incorrectly classified.
Run several experiments by stacking previously trained models to combine a larger, hopefully, more powerful model. Start by using Logistic Regression as final estimator in the StackingClassifier.
# Select only the better performing models for Stacking. Exclude lower performance models
to_stack = copy.deepcopy(best_models)
del to_stack['RandomForest']
del to_stack['LogisticRegression']
del to_stack['SVC']
to_stack
Finished loading model, total used 400 iterations
{'KNN': KNeighborsClassifier(algorithm='auto', leaf_size=2, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=8, p=1,
weights='uniform'),
'GradientBoosting': GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
learning_rate=0.1, loss='deviance', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=600,
n_iter_no_change=None, presort='deprecated',
random_state=None, subsample=1.0, tol=0.0001,
validation_fraction=0.1, verbose=0,
warm_start=False),
'CatBoost': <catboost.core.CatBoostClassifier at 0x1c1d461a10>,
'LightGBM': LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
importance_type='split', learning_rate=0.1, max_depth=3,
min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
n_estimators=400, n_jobs=-1, num_leaves=8,
objective='multiclass', random_state=None, reg_alpha=0.0,
reg_lambda=0.0, silent=True, subsample=1.0,
subsample_for_bin=200000, subsample_freq=0),
'XGBoost': XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=0.5, eval_metric='merror',
gamma=0, gpu_id=-1, importance_type='gain',
interaction_constraints=None, learning_rate=0.1, max_delta_step=0,
max_depth=4, min_child_weight=1, missing=nan,
monotone_constraints=None, n_estimators=200, n_jobs=0,
num_parallel_tree=1, objective='multi:softprob', random_state=29,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=29,
subsample=0.8, tree_method=None, validate_parameters=False,
verbosity=None)}
%%time
# Select predictors and target variable
X = sh_data[numerical_features]
Y = sh_data['Tumor type']
# Split into train and test sets
trainX, testX, trainY, testY = train_test_split(X, Y, test_size=0.2, random_state=89, stratify=Y)
# Create a final meta estimator for the stacking classifier
meta_estimator = LogisticRegression(max_iter=1000)
# Create Stacking Classifier with four estimators
stclf = ensemble.StackingClassifier(estimators=[e for e in zip(to_stack.keys(),
to_stack.values())],
final_estimator=meta_estimator, passthrough=True,
cv=10, n_jobs=-1)
# Create pipeline
stclf_pipeline = Pipeline(steps=[('numerical_pipeline', numerical_pipeline),
('stacking_clf', stclf)], verbose=3)
stclf_pipeline.fit(trainX, trainY)
# Cross validate above parameter tuning
cvScores_stacking = crossVal(stclf_pipeline, X, Y, cv_folds=10)
[Pipeline] (step 1 of 2) Processing numerical_pipeline, total= 0.0s [Pipeline] ...... (step 2 of 2) Processing stacking_clf, total= 2.2min
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None. A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 14.0min remaining: 6.0min [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 18.3min finished
Model report Cross Validated scores Sensitivity weighted (test): 0.7799 Sensitivity (test): 0.5114 AUC (train): 0.97 AUC (test): 0.8647 CPU times: user 921 ms, sys: 253 ms, total: 1.17 s Wall time: 20min 31s
pd.DataFrame(cvScores_stacking)
| fit_time | score_time | test_specificity | train_specificity | test_sensitivity | train_sensitivity | test_sensitivity_w | train_sensitivity_w | test_accuracy | train_accuracy | test_roc_auc | train_roc_auc | test_roc_auc_w | train_roc_auc_w | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 417.618603 | 0.706378 | 1.000000 | 1.0 | 0.433907 | 0.739133 | 0.782353 | 0.933115 | 0.782353 | 0.933115 | 0.832784 | 0.952121 | 0.911864 | 0.982419 |
| 1 | 417.143276 | 0.622741 | 0.975309 | 1.0 | 0.535219 | 0.748871 | 0.776471 | 0.937049 | 0.776471 | 0.937049 | 0.888800 | 0.960275 | 0.928746 | 0.984560 |
| 2 | 413.228604 | 0.817719 | 1.000000 | 1.0 | 0.507501 | 0.782590 | 0.770588 | 0.944262 | 0.770588 | 0.944262 | 0.865145 | 0.976807 | 0.920279 | 0.991371 |
| 3 | 417.375315 | 0.696312 | 0.975309 | 1.0 | 0.539460 | 0.743629 | 0.776471 | 0.933115 | 0.776471 | 0.933115 | 0.859254 | 0.964611 | 0.919880 | 0.985070 |
| 4 | 411.839261 | 0.812046 | 0.987654 | 1.0 | 0.535681 | 0.758847 | 0.788235 | 0.935738 | 0.788235 | 0.935738 | 0.855822 | 0.964106 | 0.909451 | 0.987339 |
| 5 | 411.352138 | 0.664669 | 0.987654 | 1.0 | 0.485364 | 0.752345 | 0.763314 | 0.933159 | 0.763314 | 0.933159 | 0.862748 | 0.964824 | 0.910826 | 0.985019 |
| 6 | 405.915809 | 0.590974 | 0.963415 | 1.0 | 0.483371 | 0.746766 | 0.757396 | 0.932503 | 0.757396 | 0.932503 | 0.851048 | 0.955577 | 0.912081 | 0.981504 |
| 7 | 405.377927 | 0.599476 | 1.000000 | 1.0 | 0.592048 | 0.794201 | 0.804734 | 0.948231 | 0.804734 | 0.948231 | 0.896619 | 0.976285 | 0.937502 | 0.989740 |
| 8 | 253.505293 | 0.333992 | 0.975309 | 1.0 | 0.534620 | 0.775548 | 0.798817 | 0.942333 | 0.798817 | 0.942333 | 0.860204 | 0.956781 | 0.919599 | 0.982185 |
| 9 | 251.158791 | 0.352784 | 0.975309 | 1.0 | 0.467082 | 0.789301 | 0.781065 | 0.944954 | 0.781065 | 0.944954 | 0.874767 | 0.978891 | 0.935469 | 0.991958 |
# Make predictions with the Stacking Classifier on the entire dataset
preds_stacking = cross_val_predict(stclf_pipeline, X, Y, cv=10, verbose=3, n_jobs=-1)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 13.8min remaining: 5.9min [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 17.7min finished
# Plot Confusion Matrix
plot_confusion_matrix(Y, preds_stacking, target_names=[i for i in cancers[:9]],
title='Confusion Matrix')
# Plot sensitivities
plot_sensitivities(preds_stacking, Y, title='Sensitivity per Cancer Type')
cSamples = sum((preds_stacking != 9) & (Y != 9))
ccSamples = sum(preds_stacking != 9)
tot = sum(Y != 9)
print(f'Cancer samples correctly classified (sensitivity): {cSamples}, out of {tot} ({cSamples/tot:.1%})')
print(f'Total precision on cancer samples: {cSamples/ccSamples:.1%}')
Cancer samples correctly classified (sensitivity): 865, out of 883 (98.0%) Total precision on cancer samples: 98.6%
print(classification_report(Y, preds_stacking, target_names=cancers[:9]))
precision recall f1-score support
Colorectum 0.64 0.83 0.72 346
Lung 0.53 0.45 0.48 94
Breast 0.51 0.53 0.52 174
Pancreas 0.70 0.60 0.64 82
Ovary 0.84 0.67 0.74 48
Esophagus 0.33 0.15 0.20 41
Liver 0.58 0.37 0.45 38
Stomach 0.38 0.10 0.16 60
Healthy 0.98 0.99 0.98 812
accuracy 0.78 1695
macro avg 0.61 0.52 0.55 1695
weighted avg 0.77 0.78 0.77 1695
# Print performance
performance_stacking = cv_score_summary(cvScores_stacking)
display(performance_stacking)
# Calculate AUC with standard deviations
med = performance_stacking.loc['AUC (mean)', 'Scores']
std = performance_stacking.loc['AUC (mean)', 'Std']
print("{:.1%} <= AUC <= {:.1%}".format(med-std, med+std))
| Scores | Std | |
|---|---|---|
| Specificity (med) | 0.9815 | 0.0123 |
| Sensitivity (med) | 0.5211 | 0.0429 |
| Sensitivity weighted (med) | 0.7788 | 0.0429 |
| AUC (med) | 0.8615 | 0.0174 |
| Specificity (mean) | 0.9840 | 0.0123 |
| Sensitivity (mean) | 0.5114 | 0.0429 |
| Sensitivity weighted (mean) | 0.7799 | 0.0429 |
| AUC (mean) | 0.8647 | 0.0174 |
84.7% <= AUC <= 88.2%
Use Random Forest as final estimator. Set n_estimators=300 and max_depth=4 based on generally good performance in earlier experiments.
%%time
# Select predictors and target variable
X = sh_data[numerical_features]
Y = sh_data['Tumor type']
# Split into train and test sets
trainX, testX, trainY, testY = train_test_split(X, Y, test_size=0.2, random_state=89, stratify=Y)
# Create a final meta estimator for the stacking classifier
meta_estimator = ensemble.RandomForestClassifier(n_estimators=300,
max_depth=4)
# Create Stacking Classifier with four estimators
stclf2 = ensemble.StackingClassifier(estimators=[e for e in zip(to_stack.keys(),
to_stack.values())],
final_estimator=meta_estimator, passthrough=True,
cv=10, n_jobs=-1)
# Create pipeline
stclf_pipeline2 = Pipeline(steps=[('numerical_pipeline', numerical_pipeline),
('stacking_clf', stclf2)], verbose=3)
stclf_pipeline2.fit(trainX, trainY)
# Cross validate above parameter tuning
cvScores_stacking2 = crossVal(stclf_pipeline2, X, Y, cv_folds=10)
[Pipeline] (step 1 of 2) Processing numerical_pipeline, total= 0.0s [Pipeline] ...... (step 2 of 2) Processing stacking_clf, total= 1.8min
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 14.0min remaining: 6.0min [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 18.0min finished
Model report Cross Validated scores Sensitivity weighted (test): 0.7687 Sensitivity (test): 0.4283 AUC (train): 0.97 AUC (test): 0.8541 CPU times: user 1.51 s, sys: 162 ms, total: 1.68 s Wall time: 19min 45s
pd.DataFrame(cvScores_stacking2)
| fit_time | score_time | test_specificity | train_specificity | test_sensitivity | train_sensitivity | test_sensitivity_w | train_sensitivity_w | test_accuracy | train_accuracy | test_roc_auc | train_roc_auc | test_roc_auc_w | train_roc_auc_w | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 412.720224 | 0.769883 | 1.000000 | 1.0 | 0.403131 | 0.628758 | 0.788235 | 0.899016 | 0.788235 | 0.899016 | 0.837873 | 0.971547 | 0.922321 | 0.989915 |
| 1 | 412.573826 | 0.787870 | 0.975309 | 1.0 | 0.481383 | 0.641830 | 0.788235 | 0.905574 | 0.788235 | 0.905574 | 0.861231 | 0.974673 | 0.918589 | 0.990356 |
| 2 | 412.560452 | 0.821467 | 1.000000 | 1.0 | 0.416737 | 0.606668 | 0.758824 | 0.888525 | 0.758824 | 0.888525 | 0.837191 | 0.970307 | 0.908904 | 0.991208 |
| 3 | 412.458891 | 0.810914 | 0.975309 | 1.0 | 0.421580 | 0.556607 | 0.747059 | 0.862951 | 0.747059 | 0.862951 | 0.841790 | 0.969070 | 0.914864 | 0.989623 |
| 4 | 413.165638 | 0.747526 | 0.987654 | 1.0 | 0.440365 | 0.600529 | 0.752941 | 0.885902 | 0.752941 | 0.885902 | 0.848311 | 0.969462 | 0.916576 | 0.990409 |
| 5 | 412.693569 | 0.755581 | 0.987654 | 1.0 | 0.415025 | 0.555198 | 0.769231 | 0.861730 | 0.769231 | 0.861730 | 0.890399 | 0.965339 | 0.941864 | 0.989225 |
| 6 | 412.841477 | 0.732389 | 0.975610 | 1.0 | 0.440500 | 0.572579 | 0.775148 | 0.870904 | 0.775148 | 0.870904 | 0.869440 | 0.970169 | 0.931122 | 0.991088 |
| 7 | 412.321225 | 0.774242 | 1.000000 | 1.0 | 0.480664 | 0.624867 | 0.798817 | 0.897772 | 0.798817 | 0.897772 | 0.841366 | 0.970014 | 0.912082 | 0.989975 |
| 8 | 236.926524 | 0.448539 | 0.950617 | 1.0 | 0.417861 | 0.556863 | 0.763314 | 0.863041 | 0.763314 | 0.863041 | 0.862825 | 0.962687 | 0.935079 | 0.985711 |
| 9 | 235.706539 | 0.406694 | 0.975309 | 1.0 | 0.365593 | 0.559477 | 0.745562 | 0.864351 | 0.745562 | 0.864351 | 0.851060 | 0.970286 | 0.928339 | 0.990192 |
# Make predictions with the Stacking Classifier on the entire dataset
preds_stacking2 = cross_val_predict(stclf_pipeline2, X, Y, cv=10, verbose=3, n_jobs=-1)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 17.7min remaining: 7.6min [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 21.8min finished
# Plot Confusion Matrix
plot_confusion_matrix(Y, preds_stacking2, target_names=[i for i in cancers[:9]],
title='Confusion Matrix')
# Plot sensitivities
plot_sensitivities(preds_stacking2, Y, title='Sensitivity per Cancer Type')
cSamples = sum((preds_stacking2 != 9) & (Y != 9))
ccSamples = sum(preds_stacking2 != 9)
tot = sum(Y != 9)
print(f'Cancer samples correctly classified (sensitivity): {cSamples}, out of {tot} ({cSamples/tot:.1%})')
print(f'Total precision on cancer samples: {cSamples/ccSamples:.1%}')
Cancer samples correctly classified (sensitivity): 865, out of 883 (98.0%) Total precision on cancer samples: 98.4%
print(classification_report(Y, preds_stacking2, target_names=cancers[:9]))
precision recall f1-score support
Colorectum 0.58 0.89 0.70 346
Lung 0.70 0.15 0.25 94
Breast 0.48 0.63 0.54 174
Pancreas 0.77 0.57 0.66 82
Ovary 0.82 0.67 0.74 48
Esophagus 0.00 0.00 0.00 41
Liver 0.00 0.00 0.00 38
Stomach 0.00 0.00 0.00 60
Healthy 0.98 0.98 0.98 812
accuracy 0.77 1695
macro avg 0.48 0.43 0.43 1695
weighted avg 0.73 0.77 0.73 1695
Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
# Print performance
performance_stacking2 = cv_score_summary(cvScores_stacking2)
display(performance_stacking2)
# Calculate AUC with standard deviations
med = performance_stacking2.loc['AUC (mean)', 'Scores']
std = performance_stacking2.loc['AUC (mean)', 'Std']
print("{:.1%} <= AUC <= {:.1%}".format(med-std, med+std))
| Scores | Std | |
|---|---|---|
| Specificity (med) | 0.9816 | 0.0148 |
| Sensitivity (med) | 0.4197 | 0.0330 |
| Sensitivity weighted (med) | 0.7663 | 0.0330 |
| AUC (med) | 0.8497 | 0.0161 |
| Specificity (mean) | 0.9827 | 0.0148 |
| Sensitivity (mean) | 0.4283 | 0.0330 |
| Sensitivity weighted (mean) | 0.7687 | 0.0330 |
| AUC (mean) | 0.8541 | 0.0161 |
83.8% <= AUC <= 87.0%
Use XGBoost as final estimator and train it on both the predictions from the four estimators as well as the original data by setting passthrough = True.
%%time
# Select predictors and target variable
X = sh_data[numerical_features]
Y = sh_data['Tumor type']
# Split into train and test sets
trainX, testX, trainY, testY = train_test_split(X, Y, test_size=0.2, random_state=89, stratify=Y)
# Create a final meta estimator for the stacking classifier
meta_estimator = xgb.XGBClassifier(n_estimators=400,
max_depth=4)
# Create Stacking Classifier with four estimators
stclf3 = ensemble.StackingClassifier(estimators=[e for e in zip(to_stack.keys(),
to_stack.values())],
final_estimator=meta_estimator, passthrough=True,
cv=10, n_jobs=-1)
# Create pipeline
stclf_pipeline3 = Pipeline(steps=[('numerical_pipeline', numerical_pipeline),
('stacking_clf', stclf3)], verbose=3)
stclf_pipeline3.fit(trainX, trainY)
# Cross validate above parameter tuning
cvScores_stacking3 = crossVal(stclf_pipeline3, X, Y, cv_folds=10)
[Pipeline] (step 1 of 2) Processing numerical_pipeline, total= 0.0s [Pipeline] ...... (step 2 of 2) Processing stacking_clf, total= 1.9min
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 14.5min remaining: 6.2min [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 18.7min finished
Model report Cross Validated scores Sensitivity weighted (test): 0.7776 Sensitivity (test): 0.5308 AUC (train): 0.98 AUC (test): 0.8678 CPU times: user 21.6 s, sys: 555 ms, total: 22.1 s Wall time: 20min 34s
pd.DataFrame(cvScores_stacking3)
| fit_time | score_time | test_specificity | train_specificity | test_sensitivity | train_sensitivity | test_sensitivity_w | train_sensitivity_w | test_accuracy | train_accuracy | test_roc_auc | train_roc_auc | test_roc_auc_w | train_roc_auc_w | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 427.154751 | 0.712114 | 1.000000 | 1.0 | 0.478086 | 0.728021 | 0.770588 | 0.912787 | 0.770588 | 0.912787 | 0.862939 | 0.964187 | 0.928678 | 0.986024 |
| 1 | 426.651409 | 0.651325 | 0.975309 | 1.0 | 0.550299 | 0.796364 | 0.764706 | 0.926557 | 0.764706 | 0.926557 | 0.881280 | 0.977692 | 0.923269 | 0.990624 |
| 2 | 426.278828 | 0.564325 | 1.000000 | 1.0 | 0.496732 | 0.812794 | 0.770588 | 0.939016 | 0.770588 | 0.939016 | 0.835271 | 0.985246 | 0.906668 | 0.993913 |
| 3 | 426.779901 | 0.739777 | 0.962963 | 1.0 | 0.570791 | 0.802334 | 0.782353 | 0.933770 | 0.782353 | 0.933770 | 0.860921 | 0.989337 | 0.921728 | 0.994858 |
| 4 | 429.011168 | 0.566221 | 0.987654 | 1.0 | 0.550363 | 0.746734 | 0.764706 | 0.907541 | 0.764706 | 0.907541 | 0.866847 | 0.980614 | 0.921308 | 0.991266 |
| 5 | 428.775953 | 0.761397 | 0.987654 | 1.0 | 0.490489 | 0.765591 | 0.769231 | 0.921363 | 0.769231 | 0.921363 | 0.862346 | 0.980946 | 0.922575 | 0.992565 |
| 6 | 427.760647 | 0.626463 | 0.975610 | 1.0 | 0.576920 | 0.751370 | 0.810651 | 0.919397 | 0.810651 | 0.919397 | 0.891718 | 0.973633 | 0.938957 | 0.989213 |
| 7 | 427.371011 | 0.630825 | 1.000000 | 1.0 | 0.548475 | 0.845862 | 0.781065 | 0.944954 | 0.781065 | 0.944954 | 0.858956 | 0.986040 | 0.921468 | 0.993728 |
| 8 | 248.919521 | 0.437783 | 0.962963 | 1.0 | 0.553946 | 0.775376 | 0.781065 | 0.925950 | 0.781065 | 0.925950 | 0.883540 | 0.983413 | 0.938885 | 0.992584 |
| 9 | 247.228817 | 0.367754 | 0.975309 | 1.0 | 0.492028 | 0.804770 | 0.781065 | 0.937746 | 0.781065 | 0.937746 | 0.873871 | 0.982015 | 0.936778 | 0.993184 |
# Make predictions with the Stacking Classifier on the entire dataset
preds_stacking3 = cross_val_predict(stclf_pipeline3, X, Y, cv=10, verbose=3, n_jobs=-1)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 14.4min remaining: 6.2min [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 18.5min finished
# Plot Confusion Matrix
plot_confusion_matrix(Y, preds_stacking3, target_names=[i for i in cancers[:9]],
title='Confusion Matrix')
# Plot sensitivities
plot_sensitivities(preds_stacking3, Y, title='Sensitivity per Cancer Type')
cSamples = sum((preds_stacking3 != 9) & (Y != 9))
ccSamples = sum(preds_stacking3 != 9)
tot = sum(Y != 9)
print(f'Cancer samples correctly classified (sensitivity): {cSamples}, out of {tot} ({cSamples/tot:.1%})')
print(f'Total precision on cancer samples: {cSamples/ccSamples:.1%}')
Cancer samples correctly classified (sensitivity): 864, out of 883 (97.8%) Total precision on cancer samples: 98.2%
print(classification_report(Y, preds_stacking3, target_names=cancers[:9]))
precision recall f1-score support
Colorectum 0.67 0.77 0.72 346
Lung 0.49 0.45 0.47 94
Breast 0.50 0.55 0.52 174
Pancreas 0.70 0.67 0.68 82
Ovary 0.78 0.79 0.78 48
Esophagus 0.26 0.12 0.17 41
Liver 0.39 0.32 0.35 38
Stomach 0.26 0.12 0.16 60
Healthy 0.98 0.98 0.98 812
accuracy 0.78 1695
macro avg 0.56 0.53 0.54 1695
weighted avg 0.76 0.78 0.77 1695
# Print performance
performance_stacking3 = cv_score_summary(cvScores_stacking3)
display(performance_stacking3)
# Calculate AUC with standard deviations
med = performance_stacking3.loc['AUC (mean)', 'Scores']
std = performance_stacking3.loc['AUC (mean)', 'Std']
print("{:.1%} <= AUC <= {:.1%}".format(med-std, med+std))
| Scores | Std | |
|---|---|---|
| Specificity (med) | 0.9816 | 0.0137 |
| Sensitivity (med) | 0.5494 | 0.0352 |
| Sensitivity weighted (med) | 0.7758 | 0.0352 |
| AUC (med) | 0.8649 | 0.0151 |
| Specificity (mean) | 0.9827 | 0.0137 |
| Sensitivity (mean) | 0.5308 | 0.0352 |
| Sensitivity weighted (mean) | 0.7776 | 0.0352 |
| AUC (mean) | 0.8678 | 0.0151 |
85.3% <= AUC <= 88.3%
Use XGBoost with standard hyper parameters as final estimator in the Stacked Classifier. Using standard hyper parameters decreases the risk of overfitting on this particular dataset and is more likely to perform similarly on an independent dataset.
%%time
# Select predictors and target variable
X = sh_data[numerical_features]
Y = sh_data['Tumor type']
# Split into train and test sets
trainX, testX, trainY, testY = train_test_split(X, Y, test_size=0.2, random_state=89, stratify=Y)
# Create a final meta estimator for the stacking classifier
meta_estimator = xgb.XGBClassifier()
# Create Stacking Classifier with four estimators
stclf4 = ensemble.StackingClassifier(estimators=[e for e in zip(to_stack.keys(),
to_stack.values())],
final_estimator=meta_estimator, passthrough=True,
cv=10, n_jobs=-1)
# Create pipeline
stclf_pipeline4 = Pipeline(steps=[('numerical_pipeline', numerical_pipeline),
('stacking_clf', stclf4)], verbose=3)
stclf_pipeline4.fit(trainX, trainY)
# Cross validate above parameter tuning
cvScores_stacking4 = crossVal(stclf_pipeline4, X, Y, cv_folds=10)
[Pipeline] (step 1 of 2) Processing numerical_pipeline, total= 0.0s [Pipeline] ...... (step 2 of 2) Processing stacking_clf, total= 1.8min
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 14.1min remaining: 6.1min [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 18.2min finished
Model report Cross Validated scores Sensitivity weighted (test): 0.7776 Sensitivity (test): 0.5368 AUC (train): 0.98 AUC (test): 0.8707 CPU times: user 9.29 s, sys: 413 ms, total: 9.7 s Wall time: 19min 59s
pd.DataFrame(cvScores_stacking4)
| fit_time | score_time | test_specificity | train_specificity | test_sensitivity | train_sensitivity | test_sensitivity_w | train_sensitivity_w | test_accuracy | train_accuracy | test_roc_auc | train_roc_auc | test_roc_auc_w | train_roc_auc_w | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 420.092325 | 0.712415 | 1.000000 | 1.0 | 0.531261 | 0.729479 | 0.788235 | 0.910820 | 0.788235 | 0.910820 | 0.865081 | 0.965202 | 0.932656 | 0.986209 |
| 1 | 419.955383 | 0.664288 | 0.975309 | 1.0 | 0.544126 | 0.802306 | 0.758824 | 0.928525 | 0.758824 | 0.928525 | 0.887889 | 0.983471 | 0.925915 | 0.992933 |
| 2 | 419.776523 | 0.615159 | 1.000000 | 1.0 | 0.510489 | 0.800810 | 0.782353 | 0.933770 | 0.782353 | 0.933770 | 0.854325 | 0.987663 | 0.916148 | 0.994817 |
| 3 | 419.909699 | 0.698314 | 0.962963 | 1.0 | 0.538438 | 0.773314 | 0.776471 | 0.931148 | 0.776471 | 0.931148 | 0.850304 | 0.987543 | 0.919608 | 0.994242 |
| 4 | 416.176745 | 0.644513 | 0.987654 | 1.0 | 0.534304 | 0.764550 | 0.735294 | 0.922623 | 0.735294 | 0.922623 | 0.858879 | 0.984090 | 0.915738 | 0.993106 |
| 5 | 415.443561 | 0.653398 | 0.987654 | 1.0 | 0.527526 | 0.787362 | 0.775148 | 0.930537 | 0.775148 | 0.930537 | 0.873935 | 0.985902 | 0.928482 | 0.994428 |
| 6 | 415.306687 | 0.627952 | 0.975610 | 1.0 | 0.572199 | 0.756182 | 0.792899 | 0.921363 | 0.792899 | 0.921363 | 0.885747 | 0.978589 | 0.933993 | 0.991736 |
| 7 | 414.727284 | 0.710377 | 1.000000 | 1.0 | 0.521151 | 0.848974 | 0.763314 | 0.944954 | 0.763314 | 0.944954 | 0.856377 | 0.986111 | 0.919982 | 0.994132 |
| 8 | 239.911649 | 0.373926 | 0.987654 | 1.0 | 0.523102 | 0.784103 | 0.798817 | 0.926606 | 0.798817 | 0.926606 | 0.880312 | 0.985313 | 0.941330 | 0.993314 |
| 9 | 237.306489 | 0.327178 | 0.975309 | 1.0 | 0.565357 | 0.784417 | 0.804734 | 0.931848 | 0.804734 | 0.931848 | 0.894192 | 0.984207 | 0.943744 | 0.993892 |
# Make predictions with the Stacking Classifier on the entire dataset
preds_stacking4 = cross_val_predict(stclf_pipeline4, X, Y, cv=10, verbose=3, n_jobs=-1)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 19.0min remaining: 8.1min [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 24.3min finished
# Plot Confusion Matrix
plot_confusion_matrix(Y, preds_stacking4, target_names=[i for i in cancers[:9]],
title='Confusion Matrix')
# Plot sensitivities
plot_sensitivities(preds_stacking4, Y, title='Sensitivity per Cancer Type')
cSamples = sum((preds_stacking4 != 9) & (Y != 9))
ccSamples = sum(preds_stacking4 != 9)
tot = sum(Y != 9)
print(f'Cancer samples correctly classified (sensitivity): {cSamples}, out of {tot} ({cSamples/tot:.1%})')
print(f'Total precision on cancer samples: {cSamples/ccSamples:.1%}')
Cancer samples correctly classified (sensitivity): 864, out of 883 (97.8%) Total precision on cancer samples: 98.3%
print(classification_report(Y, preds_stacking4, target_names=cancers[:9]))
precision recall f1-score support
Colorectum 0.67 0.75 0.71 346
Lung 0.50 0.47 0.48 94
Breast 0.50 0.54 0.52 174
Pancreas 0.71 0.65 0.68 82
Ovary 0.78 0.83 0.81 48
Esophagus 0.24 0.15 0.18 41
Liver 0.42 0.34 0.38 38
Stomach 0.22 0.13 0.17 60
Healthy 0.98 0.98 0.98 812
accuracy 0.78 1695
macro avg 0.56 0.54 0.54 1695
weighted avg 0.76 0.78 0.77 1695
# Print performance
performance_stacking4 = cv_score_summary(cvScores_stacking4)
display(performance_stacking4)
# Calculate AUC with standard deviations
med = performance_stacking4.loc['AUC (mean)', 'Scores']
std = performance_stacking4.loc['AUC (mean)', 'Std']
print("{:.1%} <= AUC <= {:.1%}".format(med-std, med+std))
| Scores | Std | |
|---|---|---|
| Specificity (med) | 0.9877 | 0.0121 |
| Sensitivity (med) | 0.5328 | 0.0184 |
| Sensitivity weighted (med) | 0.7794 | 0.0184 |
| AUC (med) | 0.8695 | 0.0150 |
| Specificity (mean) | 0.9852 | 0.0121 |
| Sensitivity (mean) | 0.5368 | 0.0184 |
| Sensitivity weighted (mean) | 0.7776 | 0.0184 |
| AUC (mean) | 0.8707 | 0.0150 |
85.6% <= AUC <= 88.6%
%%time
# Select predictors and target variable
X = sh_data[numerical_features]
Y = sh_data['Tumor type']
# Split into train and test sets
trainX, testX, trainY, testY = train_test_split(X, Y, test_size=0.2, random_state=89, stratify=Y)
# Create a final meta estimator for the stacking classifier
meta_estimator = CatBoostClassifier(learning_rate=0.1,
n_estimators=500,
max_depth=3,
eval_metric="MultiClass",
bootstrap_type="Bernoulli",
silent=True)
# Create Stacking Classifier with four estimators
stclf5 = ensemble.StackingClassifier(estimators=[e for e in zip(to_stack.keys(),
to_stack.values())],
final_estimator=meta_estimator, passthrough=True,
cv=10, n_jobs=-1)
# Create pipeline
stclf_pipeline5 = Pipeline(steps=[('numerical_pipeline', numerical_pipeline),
('stacking_clf', stclf5)], verbose=3)
stclf_pipeline5.fit(trainX, trainY)
# Cross validate above parameter tuning
cvScores_stacking5 = crossVal(stclf_pipeline5, X, Y, cv_folds=10)
[Pipeline] (step 1 of 2) Processing numerical_pipeline, total= 0.0s [Pipeline] ...... (step 2 of 2) Processing stacking_clf, total= 2.4min
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 17.9min remaining: 7.7min [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 23.9min finished
Model report Cross Validated scores Sensitivity weighted (test): 0.7941 Sensitivity (test): 0.5625 AUC (train): 0.98 AUC (test): 0.8849 CPU times: user 12.6 s, sys: 502 ms, total: 13.1 s Wall time: 26min 19s
pd.DataFrame(cvScores_stacking5)
| fit_time | score_time | test_specificity | train_specificity | test_sensitivity | train_sensitivity | test_sensitivity_w | train_sensitivity_w | test_accuracy | train_accuracy | test_roc_auc | train_roc_auc | test_roc_auc_w | train_roc_auc_w | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 534.615660 | 0.606736 | 1.000000 | 1.0 | 0.506746 | 0.769376 | 0.805882 | 0.933115 | 0.805882 | 0.933115 | 0.863176 | 0.978969 | 0.930975 | 0.992000 |
| 1 | 535.728437 | 0.679588 | 0.975309 | 1.0 | 0.599858 | 0.728564 | 0.788235 | 0.919344 | 0.788235 | 0.919344 | 0.897124 | 0.967170 | 0.934120 | 0.986577 |
| 2 | 535.419879 | 0.768534 | 1.000000 | 1.0 | 0.556730 | 0.785216 | 0.794118 | 0.937049 | 0.794118 | 0.937049 | 0.863486 | 0.984175 | 0.921097 | 0.993380 |
| 3 | 534.015946 | 0.667283 | 0.962963 | 1.0 | 0.527724 | 0.779395 | 0.782353 | 0.931803 | 0.782353 | 0.931803 | 0.878962 | 0.988198 | 0.931559 | 0.993791 |
| 4 | 521.410073 | 0.964215 | 0.987654 | 1.0 | 0.645259 | 0.773480 | 0.800000 | 0.929836 | 0.800000 | 0.929836 | 0.889348 | 0.983781 | 0.933698 | 0.992845 |
| 5 | 521.468561 | 1.146465 | 1.000000 | 1.0 | 0.535777 | 0.745180 | 0.792899 | 0.926606 | 0.792899 | 0.926606 | 0.883115 | 0.976641 | 0.935999 | 0.989425 |
| 6 | 520.442214 | 1.072275 | 0.963415 | 1.0 | 0.589726 | 0.756470 | 0.792899 | 0.919397 | 0.792899 | 0.919397 | 0.903173 | 0.973714 | 0.944413 | 0.988372 |
| 7 | 520.481682 | 1.163096 | 1.000000 | 1.0 | 0.544662 | 0.783284 | 0.769231 | 0.936435 | 0.769231 | 0.936435 | 0.883601 | 0.989773 | 0.929802 | 0.994783 |
| 8 | 357.724233 | 0.390601 | 0.962963 | 1.0 | 0.580543 | 0.798394 | 0.804734 | 0.935125 | 0.804734 | 0.935125 | 0.898698 | 0.984561 | 0.946151 | 0.992702 |
| 9 | 356.160422 | 0.519151 | 0.975309 | 1.0 | 0.537761 | 0.766378 | 0.810651 | 0.930537 | 0.810651 | 0.930537 | 0.888098 | 0.983658 | 0.944119 | 0.993208 |
# Make predictions with the Stacking Classifier on the entire dataset
preds_stacking5 = cross_val_predict(stclf_pipeline5, X, Y, cv=10, verbose=3, n_jobs=-1)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 16.2min remaining: 7.0min [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 20.2min finished
# Plot Confusion Matrix
plot_confusion_matrix(Y, preds_stacking5, target_names=[i for i in cancers[:9]],
title='Confusion Matrix')
# Plot sensitivities
plot_sensitivities(preds_stacking5, Y, title='Sensitivity per Cancer Type')
cSamples = sum((preds_stacking5 != 9) & (Y != 9))
ccSamples = sum(preds_stacking5 != 9)
tot = sum(Y != 9)
print(f'Cancer samples correctly classified (sensitivity): {cSamples}, out of {tot} ({cSamples/tot:.1%})')
print(f'Total precision on cancer samples: {cSamples/ccSamples:.1%}')
Cancer samples correctly classified (sensitivity): 867, out of 883 (98.2%) Total precision on cancer samples: 98.5%
print(classification_report(Y, preds_stacking5, target_names=cancers[:9]))
precision recall f1-score support
Colorectum 0.69 0.79 0.74 346
Lung 0.46 0.45 0.45 94
Breast 0.51 0.56 0.53 174
Pancreas 0.68 0.67 0.67 82
Ovary 0.82 0.83 0.82 48
Esophagus 0.39 0.17 0.24 41
Liver 0.44 0.37 0.40 38
Stomach 0.33 0.12 0.17 60
Healthy 0.98 0.98 0.98 812
accuracy 0.79 1695
macro avg 0.59 0.55 0.56 1695
weighted avg 0.77 0.79 0.78 1695
# Print performance
performance_stacking5 = cv_score_summary(cvScores_stacking5)
display(performance_stacking5)
# Calculate AUC with standard deviations
med = performance_stacking5.loc['AUC (mean)', 'Scores']
std = performance_stacking5.loc['AUC (mean)', 'Std']
print("{:.1%} <= AUC <= {:.1%}".format(med-std, med+std))
| Scores | Std | |
|---|---|---|
| Specificity (med) | 0.9815 | 0.0158 |
| Sensitivity (med) | 0.5507 | 0.0391 |
| Sensitivity weighted (med) | 0.7935 | 0.0391 |
| AUC (med) | 0.8858 | 0.0129 |
| Specificity (mean) | 0.9828 | 0.0158 |
| Sensitivity (mean) | 0.5625 | 0.0391 |
| Sensitivity weighted (mean) | 0.7941 | 0.0391 |
| AUC (mean) | 0.8849 | 0.0129 |
87.2% <= AUC <= 89.8%
# Select the single Aneuploidy feature
aneuploidy = ['Aneuploidy']
# Define the steps in the numerical pipeline
aneuploidy_pipeline = Pipeline(steps=[('numerical_selector', FeatureSelector(aneuploidy)),
('PercentileTransformer', PercentileTransformer(percentile=0.90,
healthy_class=9)),
('StandardScaler', StandardScaler())])
%%time
# Select predictors and target variable
Xa = sh_data[aneuploidy]
Ya = sh_data['Tumor type']
# Split into train and test sets
trainXa, testXa, trainYa, testYa = train_test_split(Xa, Ya, test_size=0.2, random_state=89, stratify=Ya)
# Create classifiers
logReg = LogisticRegression(max_iter=1000)
knn = KNeighborsClassifier()
svc = SVC(probability=True)
rf = ensemble.RandomForestClassifier()
gb = ensemble.GradientBoostingClassifier(learning_rate=0.1)
ct = CatBoostClassifier(learning_rate=.1,
eval_metric='MultiClass',
bootstrap_type='Bernoulli',
silent=True)
lgbm = lgb.LGBMClassifier(learning_rate=0.1,
objective='multiclass')
xgboost = xgb.XGBClassifier(learning_rate=0.1,
max_depth=4,
min_child_weight=1,
gamma=0,
subsample=.8,
colsample_bytree=.8,
scale_pos_weight=1,
booster='gbtree',
eval_metric='merror',
objective='multi:softprob',
seed=29)
# Specify classifier names and add them in a list
names = ['LogisticRegression', 'KNN', 'SVC', 'RandomForest',
'GradientBoosting', 'CatBoost', 'LightGBM', 'XGBoost']
classifiers = [logReg, knn, svc, rf, gb, ct, lgbm, xgboost]
# Specify hyper parameters to tune for each classifier
parameters = [{'clf__C': [0.1, 1, 10, 50, 100]},
{'clf__n_neighbors': [3, 4, 5, 6, 7, 8, 9, 10, 11],
'clf__weights': ['uniform', 'distance'],
'clf__leaf_size': [2, 3, 4, 5, 6, 8, 10, 20],
'clf__p': [1, 2, 3]},
{'clf__C': [1, 10, 50, 100],
'clf__kernel': ['linear', 'rbf']},
{'clf__max_depth': [4, 5, 6],
'clf__n_estimators': [300, 400, 500, 600],
'clf__max_samples': [0.5, 0.7, 0.9, 1],
'clf__max_features': [0.25, 0.5, 0.75, 1]},
{'clf__n_estimators': [400, 500, 600],
'clf__max_depth': [3, 4, 5]},
{'clf__max_depth': [3, 4, 5],
'clf__n_estimators': [300, 400, 500, 600]},
{'clf__max_depth': [3, 4, 5],
'clf__n_estimators': [400, 500, 600],
'clf__num_leaves': [8, 16, 32, 64]},
{'clf__n_estimators': [200, 300, 400, 500],
'clf__max_depth': [3, 4, 5],
'clf__colsample_bytree': [.5, .75, 1.]}]
# Create dictionaries to store the results
crossVal_scores_A = {} ; best_models_A = {}; predictions_A = {}
# Train and evaluate a number of estimators in a pipeline
for name, classifier, params in zip(names, classifiers, parameters):
print('\n\n============================================================================')
print(f'================================== {name} ==================================')
print('============================================================================')
# Create pipeline with the 10 features and an estimator
clf_pipeline = Pipeline(steps=[('aneuploidy_pipeline', aneuploidy_pipeline),
('clf', classifier)])
# GridSearch
gs_clf = GridSearchCV(clf_pipeline, params,
scoring={'recall': 'recall_weighted'},
refit='recall', cv=10, n_jobs=4, verbose=3)
gs_clf.fit(trainXa, trainYa)
# Cross validate above parameter tuning
cv_scores = crossVal(gs_clf, Xa, Ya, nestedCV=False, cv_folds=10)
# Pause for slimmer printouts
time.sleep(1)
# Select the best model and make predictions on the entire dataset using the pipeline
model = gs_clf.best_estimator_['clf']
clf_pipeline = Pipeline(steps=[('aneuploidy_pipeline', aneuploidy_pipeline),
('model', model)],
verbose=1)
preds = cross_val_predict(clf_pipeline, Xa, Ya, cv=10, verbose=1, n_jobs=4)
# Save results to dictionaries
crossVal_scores_A[name] = cv_scores
best_models_A[name] = model
predictions_A[name] = preds
============================================================================ ================================== LogisticRegression ================================== ============================================================================ Fitting 10 folds for each of 5 candidates, totalling 50 fits
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=4)]: Done 24 tasks | elapsed: 3.5s [Parallel(n_jobs=4)]: Done 50 out of 50 | elapsed: 4.5s finished [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 1.4s remaining: 0.6s [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 1.6s finished
Model report
Best parameters: {'clf__C': 50}
Best score: 0.6187418300653594
Cross Validated scores
Sensitivity weighted (test): 0.6148
Sensitivity (test): 0.1886
AUC (train): 0.62
AUC (test): 0.6147
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=4)]: Done 10 out of 10 | elapsed: 0.3s finished [Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
============================================================================ ================================== KNN ================================== ============================================================================ Fitting 10 folds for each of 432 candidates, totalling 4320 fits
[Parallel(n_jobs=4)]: Done 56 tasks | elapsed: 0.8s [Parallel(n_jobs=4)]: Done 440 tasks | elapsed: 5.2s [Parallel(n_jobs=4)]: Done 1080 tasks | elapsed: 14.5s [Parallel(n_jobs=4)]: Done 1976 tasks | elapsed: 26.0s [Parallel(n_jobs=4)]: Done 3128 tasks | elapsed: 39.8s [Parallel(n_jobs=4)]: Done 4320 out of 4320 | elapsed: 51.4s finished [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 1.2s remaining: 0.5s [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 1.5s finished
Model report
Best parameters: {'clf__leaf_size': 2, 'clf__n_neighbors': 10, 'clf__p': 1, 'clf__weights': 'uniform'}
Best score: 0.6099019607843139
Cross Validated scores
Sensitivity weighted (test): 0.6083
Sensitivity (test): 0.2092
AUC (train): 0.79
AUC (test): 0.5975
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=4)]: Done 10 out of 10 | elapsed: 0.2s finished
============================================================================ ================================== SVC ================================== ============================================================================ Fitting 10 folds for each of 8 candidates, totalling 80 fits
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=4)]: Done 39 tasks | elapsed: 5.4s [Parallel(n_jobs=4)]: Done 80 out of 80 | elapsed: 21.9s finished [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 2.5s remaining: 1.1s [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 3.2s finished
Model report
Best parameters: {'clf__C': 1, 'clf__kernel': 'rbf'}
Best score: 0.6246459694989107
Cross Validated scores
Sensitivity weighted (test): 0.6230
Sensitivity (test): 0.1946
AUC (train): 0.61
AUC (test): 0.6035
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=4)]: Done 10 out of 10 | elapsed: 1.1s finished [Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
============================================================================ ================================== RandomForest ================================== ============================================================================ Fitting 10 folds for each of 192 candidates, totalling 1920 fits
[Parallel(n_jobs=4)]: Done 24 tasks | elapsed: 8.4s [Parallel(n_jobs=4)]: Done 120 tasks | elapsed: 50.9s [Parallel(n_jobs=4)]: Done 280 tasks | elapsed: 2.2min [Parallel(n_jobs=4)]: Done 504 tasks | elapsed: 4.3min [Parallel(n_jobs=4)]: Done 792 tasks | elapsed: 6.7min [Parallel(n_jobs=4)]: Done 1144 tasks | elapsed: 9.6min [Parallel(n_jobs=4)]: Done 1560 tasks | elapsed: 12.8min [Parallel(n_jobs=4)]: Done 1920 out of 1920 | elapsed: 15.6min finished [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 3.8s remaining: 1.6s [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 5.1s finished
Model report
Best parameters: {'clf__max_depth': 4, 'clf__max_features': 1, 'clf__max_samples': 0.7, 'clf__n_estimators': 300}
Best score: 0.6239106753812637
Cross Validated scores
Sensitivity weighted (test): 0.6201
Sensitivity (test): 0.1981
AUC (train): 0.72
AUC (test): 0.6250
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=4)]: Done 10 out of 10 | elapsed: 2.9s finished [Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
============================================================================ ================================== GradientBoosting ================================== ============================================================================ Fitting 10 folds for each of 9 candidates, totalling 90 fits
[Parallel(n_jobs=4)]: Done 24 tasks | elapsed: 1.0min [Parallel(n_jobs=4)]: Done 90 out of 90 | elapsed: 5.1min finished [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 30.9s remaining: 13.2s [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 38.0s finished
Model report
Best parameters: {'clf__max_depth': 4, 'clf__n_estimators': 400}
Best score: 0.5324618736383443
Cross Validated scores
Sensitivity weighted (test): 0.5463
Sensitivity (test): 0.2010
AUC (train): 0.98
AUC (test): 0.5330
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=4)]: Done 10 out of 10 | elapsed: 42.4s finished [Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
============================================================================ ================================== CatBoost ================================== ============================================================================ Fitting 10 folds for each of 12 candidates, totalling 120 fits
[Parallel(n_jobs=4)]: Done 24 tasks | elapsed: 13.9s [Parallel(n_jobs=4)]: Done 120 out of 120 | elapsed: 1.6min finished [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 6.7s remaining: 2.9s [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 7.7s finished
Model report
Best parameters: {'clf__max_depth': 3, 'clf__n_estimators': 300}
Best score: 0.6216884531590414
Cross Validated scores
Sensitivity weighted (test): 0.6195
Sensitivity (test): 0.2083
AUC (train): 0.74
AUC (test): 0.6291
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=4)]: Done 10 out of 10 | elapsed: 7.0s finished [Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
============================================================================ ================================== LightGBM ================================== ============================================================================ Fitting 10 folds for each of 36 candidates, totalling 360 fits
[Parallel(n_jobs=4)]: Done 24 tasks | elapsed: 12.7s [Parallel(n_jobs=4)]: Done 120 tasks | elapsed: 1.2min [Parallel(n_jobs=4)]: Done 280 tasks | elapsed: 2.9min A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak. [Parallel(n_jobs=4)]: Done 360 out of 360 | elapsed: 4.0min finished [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 12.7s remaining: 5.4s [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 16.1s finished
Model report
Best parameters: {'clf__max_depth': 3, 'clf__n_estimators': 500, 'clf__num_leaves': 8}
Best score: 0.5929411764705883
Cross Validated scores
Sensitivity weighted (test): 0.6018
Sensitivity (test): 0.2155
AUC (train): 0.85
AUC (test): 0.6130
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=4)]: Done 10 out of 10 | elapsed: 5.3s finished [Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
============================================================================ ================================== XGBoost ================================== ============================================================================ Fitting 10 folds for each of 36 candidates, totalling 360 fits
[Parallel(n_jobs=4)]: Done 24 tasks | elapsed: 13.7s [Parallel(n_jobs=4)]: Done 120 tasks | elapsed: 1.3min [Parallel(n_jobs=4)]: Done 280 tasks | elapsed: 2.8min [Parallel(n_jobs=4)]: Done 360 out of 360 | elapsed: 3.6min finished [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 5.7s remaining: 2.4s [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 8.5s finished
Model report
Best parameters: {'clf__colsample_bytree': 0.5, 'clf__max_depth': 3, 'clf__n_estimators': 200}
Best score: 0.5803867102396514
Cross Validated scores
Sensitivity weighted (test): 0.5964
Sensitivity (test): 0.2065
AUC (train): 0.85
AUC (test): 0.6195
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
CPU times: user 1min 11s, sys: 2.12 s, total: 1min 13s Wall time: 34min 8s
[Parallel(n_jobs=4)]: Done 10 out of 10 | elapsed: 4.0s finished
Plot some statistics on above models. For some reasons the shap and feature importance plots need to be plotted separately. Inconsistent output if not.
for name in names:
print('\n\n============================================================================')
print(f'================================== {name} ==================================')
print('============================================================================')
# Plot Confusion Matrix
plot_confusion_matrix(Ya, predictions_A[name], target_names=[i for i in cancers[:9]],
title='Confusion Matrix')
# Plot sensitivities
plot_sensitivities(predictions_A[name], Ya, title='Sensitivity per Cancer Type')
# Print the fraction of cancer/healthy samples classified
if name == 'CatBoost':
# Minor adjustment of the prediction array for the CatBoost predictions
predsCat = np.array([i[0] for i in predictions_A[name]])
cSamples = sum((predsCat != 9) & (Ya != 9))
ccSamples = sum(predsCat != 9)
else:
cSamples = sum((predictions_A[name] != 9) & (Ya != 9))
ccSamples = sum(predictions_A[name] != 9)
tot = sum(Ya != 9)
print(f'Cancer samples correctly classified (sensitivity): {cSamples}, out of {tot} ({cSamples/tot:.1%})')
print(f'Total precision on cancer samples: {cSamples/ccSamples:.1%}')
# Print cross-validation scores
display(pd.DataFrame(crossVal_scores_A[name]))
# Print calsssification report
print(classification_report(Ya, predictions_A[name], target_names=cancers[:9]))
# Print performance
performance = cv_score_summary(crossVal_scores_A[name])
display(performance)
# Calculate AUC with standard deviations
med = performance.loc['AUC (mean)', 'Scores']
std = performance.loc['AUC (mean)', 'Std']
print("{:.1%} <= AUC <= {:.1%}".format(med-std, med+std))
============================================================================ ================================== LogisticRegression ================================== ============================================================================
Cancer samples correctly classified (sensitivity): 583, out of 883 (66.0%) Total precision on cancer samples: 96.7%
| fit_time | score_time | test_specificity | train_specificity | test_sensitivity | train_sensitivity | test_sensitivity_w | train_sensitivity_w | test_accuracy | train_accuracy | test_roc_auc | train_roc_auc | test_roc_auc_w | train_roc_auc_w | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.162545 | 0.252167 | 0.975309 | 0.975376 | 0.184558 | 0.189118 | 0.605882 | 0.615738 | 0.605882 | 0.615738 | 0.599469 | 0.622462 | 0.717695 | 0.744923 |
| 1 | 0.080934 | 0.176546 | 0.987654 | 0.974008 | 0.198628 | 0.187537 | 0.635294 | 0.612459 | 0.635294 | 0.612459 | 0.618396 | 0.618489 | 0.747627 | 0.740878 |
| 2 | 0.077889 | 0.198772 | 0.962963 | 0.976744 | 0.199059 | 0.187841 | 0.629412 | 0.613770 | 0.629412 | 0.613770 | 0.558765 | 0.629399 | 0.707135 | 0.746147 |
| 3 | 0.067422 | 0.205836 | 0.962963 | 0.976744 | 0.167313 | 0.192486 | 0.570588 | 0.622295 | 0.570588 | 0.622295 | 0.586821 | 0.626690 | 0.705415 | 0.746034 |
| 4 | 0.078490 | 0.212536 | 0.962963 | 0.976744 | 0.195885 | 0.188198 | 0.623529 | 0.614426 | 0.623529 | 0.614426 | 0.624922 | 0.607820 | 0.745848 | 0.737142 |
| 5 | 0.071968 | 0.217197 | 0.950617 | 0.978112 | 0.188164 | 0.189780 | 0.609467 | 0.617300 | 0.609467 | 0.617300 | 0.609759 | 0.620772 | 0.720880 | 0.743959 |
| 6 | 0.095520 | 0.195774 | 0.975610 | 0.975342 | 0.183564 | 0.189212 | 0.609467 | 0.615334 | 0.609467 | 0.615334 | 0.633922 | 0.618574 | 0.763536 | 0.739564 |
| 7 | 0.097104 | 0.167899 | 0.987805 | 0.973973 | 0.194723 | 0.187991 | 0.633136 | 0.612713 | 0.633136 | 0.612713 | 0.639688 | 0.615974 | 0.766959 | 0.738624 |
| 8 | 0.071268 | 0.090594 | 0.987654 | 0.974008 | 0.181635 | 0.189420 | 0.603550 | 0.615990 | 0.603550 | 0.615990 | 0.644269 | 0.622013 | 0.757580 | 0.744826 |
| 9 | 0.058165 | 0.083419 | 1.000000 | 0.972640 | 0.192810 | 0.188199 | 0.627219 | 0.613368 | 0.627219 | 0.613368 | 0.630613 | 0.617442 | 0.757727 | 0.740444 |
precision recall f1-score support
Colorectum 0.41 0.72 0.53 346
Lung 0.00 0.00 0.00 94
Breast 0.00 0.00 0.00 174
Pancreas 0.00 0.00 0.00 82
Ovary 0.00 0.00 0.00 48
Esophagus 0.00 0.00 0.00 41
Liver 0.00 0.00 0.00 38
Stomach 0.00 0.00 0.00 60
Healthy 0.73 0.98 0.83 812
accuracy 0.61 1695
macro avg 0.13 0.19 0.15 1695
weighted avg 0.43 0.61 0.51 1695
Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
| Scores | Std | |
|---|---|---|
| Specificity (med) | 0.9755 | 0.0146 |
| Sensitivity (med) | 0.1905 | 0.0093 |
| Sensitivity weighted (med) | 0.6165 | 0.0093 |
| AUC (med) | 0.6217 | 0.0253 |
| Specificity (mean) | 0.9754 | 0.0146 |
| Sensitivity (mean) | 0.1886 | 0.0093 |
| Sensitivity weighted (mean) | 0.6148 | 0.0093 |
| AUC (mean) | 0.6147 | 0.0253 |
58.9% <= AUC <= 64.0% ============================================================================ ================================== KNN ================================== ============================================================================
Cancer samples correctly classified (sensitivity): 690, out of 883 (78.1%) Total precision on cancer samples: 94.0%
| fit_time | score_time | test_specificity | train_specificity | test_sensitivity | train_sensitivity | test_sensitivity_w | train_sensitivity_w | test_accuracy | train_accuracy | test_roc_auc | train_roc_auc | test_roc_auc_w | train_roc_auc_w | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.017422 | 0.214847 | 0.950617 | 0.960328 | 0.233622 | 0.238493 | 0.629412 | 0.640656 | 0.629412 | 0.640656 | 0.556592 | 0.796951 | 0.654423 | 0.835490 |
| 1 | 0.023643 | 0.217680 | 0.987654 | 0.954856 | 0.223849 | 0.241459 | 0.641176 | 0.644590 | 0.641176 | 0.644590 | 0.593987 | 0.791585 | 0.693628 | 0.833656 |
| 2 | 0.021389 | 0.213188 | 0.950617 | 0.953488 | 0.196100 | 0.239349 | 0.605882 | 0.640000 | 0.605882 | 0.640000 | 0.575940 | 0.799271 | 0.687959 | 0.837230 |
| 3 | 0.057832 | 0.220987 | 0.925926 | 0.954856 | 0.160024 | 0.247328 | 0.547059 | 0.648525 | 0.547059 | 0.648525 | 0.609627 | 0.790534 | 0.696153 | 0.833640 |
| 4 | 0.015373 | 0.193178 | 0.950617 | 0.957592 | 0.214938 | 0.240735 | 0.629412 | 0.641311 | 0.629412 | 0.641311 | 0.599311 | 0.789593 | 0.703087 | 0.831717 |
| 5 | 0.025223 | 0.173803 | 0.913580 | 0.957592 | 0.171350 | 0.250037 | 0.568047 | 0.646789 | 0.568047 | 0.646789 | 0.613850 | 0.793818 | 0.692266 | 0.838186 |
| 6 | 0.016679 | 0.175973 | 0.951220 | 0.956164 | 0.250716 | 0.241698 | 0.633136 | 0.640891 | 0.633136 | 0.640891 | 0.644546 | 0.787289 | 0.738173 | 0.829673 |
| 7 | 0.014557 | 0.169044 | 0.975610 | 0.954795 | 0.209055 | 0.239946 | 0.621302 | 0.641547 | 0.621302 | 0.641547 | 0.567032 | 0.792181 | 0.681762 | 0.831819 |
| 8 | 0.018715 | 0.117584 | 0.938272 | 0.949384 | 0.198297 | 0.243033 | 0.603550 | 0.644823 | 0.603550 | 0.644823 | 0.608440 | 0.803712 | 0.703577 | 0.841212 |
| 9 | 0.014081 | 0.101221 | 0.913580 | 0.957592 | 0.233826 | 0.227924 | 0.603550 | 0.636959 | 0.603550 | 0.636959 | 0.605862 | 0.788419 | 0.702824 | 0.832383 |
precision recall f1-score support
Colorectum 0.42 0.71 0.53 346
Lung 0.07 0.02 0.03 94
Breast 0.08 0.03 0.04 174
Pancreas 0.23 0.04 0.06 82
Ovary 0.35 0.15 0.21 48
Esophagus 0.00 0.00 0.00 41
Liver 0.00 0.00 0.00 38
Stomach 0.00 0.00 0.00 60
Healthy 0.80 0.95 0.87 812
accuracy 0.61 1695
macro avg 0.22 0.21 0.19 1695
weighted avg 0.50 0.61 0.54 1695
| Scores | Std | |
|---|---|---|
| Specificity (med) | 0.9506 | 0.0230 |
| Sensitivity (med) | 0.2120 | 0.0270 |
| Sensitivity weighted (med) | 0.6136 | 0.0270 |
| AUC (med) | 0.6026 | 0.0243 |
| Specificity (mean) | 0.9458 | 0.0230 |
| Sensitivity (mean) | 0.2092 | 0.0270 |
| Sensitivity weighted (mean) | 0.6083 | 0.0270 |
| AUC (mean) | 0.5975 | 0.0243 |
57.3% <= AUC <= 62.2% ============================================================================ ================================== SVC ================================== ============================================================================
Cancer samples correctly classified (sensitivity): 626, out of 883 (70.9%) Total precision on cancer samples: 95.7%
| fit_time | score_time | test_specificity | train_specificity | test_sensitivity | train_sensitivity | test_sensitivity_w | train_sensitivity_w | test_accuracy | train_accuracy | test_roc_auc | train_roc_auc | test_roc_auc_w | train_roc_auc_w | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.487790 | 0.250156 | 0.962963 | 0.965800 | 0.192710 | 0.194842 | 0.617647 | 0.623607 | 0.617647 | 0.623607 | 0.600736 | 0.603281 | 0.708689 | 0.733771 |
| 1 | 0.508397 | 0.237883 | 0.987654 | 0.964432 | 0.204977 | 0.193261 | 0.647059 | 0.620328 | 0.647059 | 0.620328 | 0.612670 | 0.599971 | 0.738995 | 0.729501 |
| 2 | 0.488255 | 0.238152 | 0.950617 | 0.965800 | 0.200862 | 0.195200 | 0.629412 | 0.624262 | 0.629412 | 0.624262 | 0.553288 | 0.617498 | 0.706645 | 0.739976 |
| 3 | 0.443303 | 0.245244 | 0.962963 | 0.967168 | 0.173663 | 0.196781 | 0.582353 | 0.627541 | 0.582353 | 0.627541 | 0.587253 | 0.614032 | 0.711801 | 0.738577 |
| 4 | 0.439306 | 0.211026 | 0.962963 | 0.964432 | 0.199059 | 0.195048 | 0.629412 | 0.623607 | 0.629412 | 0.623607 | 0.608051 | 0.605807 | 0.737597 | 0.729872 |
| 5 | 0.412089 | 0.253386 | 0.938272 | 0.967168 | 0.193141 | 0.195709 | 0.615385 | 0.625164 | 0.615385 | 0.625164 | 0.590219 | 0.604259 | 0.706147 | 0.733876 |
| 6 | 0.414042 | 0.247927 | 0.975610 | 0.965753 | 0.196636 | 0.194201 | 0.633136 | 0.621887 | 0.633136 | 0.621887 | 0.619280 | 0.605265 | 0.748800 | 0.730769 |
| 7 | 0.389863 | 0.264375 | 0.975610 | 0.964384 | 0.199904 | 0.194404 | 0.639053 | 0.621887 | 0.639053 | 0.621887 | 0.597281 | 0.596260 | 0.739354 | 0.724793 |
| 8 | 0.418505 | 0.181978 | 0.987654 | 0.963064 | 0.191439 | 0.196038 | 0.621302 | 0.625164 | 0.621302 | 0.625164 | 0.641565 | 0.597227 | 0.756216 | 0.728241 |
| 9 | 0.349697 | 0.188138 | 0.950617 | 0.965800 | 0.193859 | 0.195630 | 0.615385 | 0.625164 | 0.615385 | 0.625164 | 0.624378 | 0.607801 | 0.753134 | 0.731540 |
precision recall f1-score support
Colorectum 0.42 0.79 0.54 346
Lung 0.00 0.00 0.00 94
Breast 0.00 0.00 0.00 174
Pancreas 0.00 0.00 0.00 82
Ovary 0.00 0.00 0.00 48
Esophagus 0.00 0.00 0.00 41
Liver 0.00 0.00 0.00 38
Stomach 0.00 0.00 0.00 60
Healthy 0.75 0.97 0.85 812
accuracy 0.62 1695
macro avg 0.13 0.19 0.15 1695
weighted avg 0.45 0.62 0.52 1695
| Scores | Std | |
|---|---|---|
| Specificity (med) | 0.9630 | 0.0155 |
| Sensitivity (med) | 0.1952 | 0.0081 |
| Sensitivity weighted (med) | 0.6254 | 0.0081 |
| AUC (med) | 0.6044 | 0.0229 |
| Specificity (mean) | 0.9655 | 0.0155 |
| Sensitivity (mean) | 0.1946 | 0.0081 |
| Sensitivity weighted (mean) | 0.6230 | 0.0081 |
| AUC (mean) | 0.6035 | 0.0229 |
58.1% <= AUC <= 62.6% ============================================================================ ================================== RandomForest ================================== ============================================================================
Cancer samples correctly classified (sensitivity): 690, out of 883 (78.1%) Total precision on cancer samples: 93.8%
| fit_time | score_time | test_specificity | train_specificity | test_sensitivity | train_sensitivity | test_sensitivity_w | train_sensitivity_w | test_accuracy | train_accuracy | test_roc_auc | train_roc_auc | test_roc_auc_w | train_roc_auc_w | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.952566 | 0.275129 | 0.950617 | 0.956224 | 0.197511 | 0.208028 | 0.617647 | 0.633443 | 0.617647 | 0.633443 | 0.623578 | 0.729940 | 0.725452 | 0.803249 |
| 1 | 0.983268 | 0.269400 | 0.975309 | 0.948016 | 0.203606 | 0.205355 | 0.641176 | 0.633443 | 0.641176 | 0.633443 | 0.628756 | 0.716008 | 0.749068 | 0.793918 |
| 2 | 0.999907 | 0.290879 | 0.950617 | 0.952120 | 0.200862 | 0.206123 | 0.629412 | 0.634754 | 0.629412 | 0.634754 | 0.576796 | 0.725807 | 0.724967 | 0.798764 |
| 3 | 0.964813 | 0.289326 | 0.925926 | 0.950752 | 0.172722 | 0.210932 | 0.570588 | 0.639344 | 0.570588 | 0.639344 | 0.606720 | 0.721697 | 0.720778 | 0.799163 |
| 4 | 1.084427 | 0.407014 | 0.950617 | 0.952120 | 0.207398 | 0.208589 | 0.635294 | 0.636066 | 0.635294 | 0.636066 | 0.626033 | 0.725624 | 0.742229 | 0.798188 |
| 5 | 1.014747 | 0.398915 | 0.913580 | 0.949384 | 0.190398 | 0.206526 | 0.603550 | 0.633683 | 0.603550 | 0.633683 | 0.616046 | 0.720330 | 0.714845 | 0.801025 |
| 6 | 1.073085 | 0.379459 | 0.951220 | 0.954795 | 0.197194 | 0.201525 | 0.627219 | 0.631717 | 0.627219 | 0.631717 | 0.644094 | 0.723100 | 0.767295 | 0.796097 |
| 7 | 1.052827 | 0.380821 | 0.975610 | 0.952055 | 0.206440 | 0.208289 | 0.644970 | 0.635649 | 0.644970 | 0.635649 | 0.626067 | 0.724381 | 0.757080 | 0.795538 |
| 8 | 0.873439 | 0.190244 | 0.925926 | 0.949384 | 0.200557 | 0.206983 | 0.615385 | 0.634338 | 0.615385 | 0.634338 | 0.662684 | 0.727477 | 0.764550 | 0.802168 |
| 9 | 0.841564 | 0.189715 | 0.901235 | 0.943912 | 0.204349 | 0.210090 | 0.615385 | 0.633028 | 0.615385 | 0.633028 | 0.639497 | 0.726297 | 0.760643 | 0.800089 |
precision recall f1-score support
Colorectum 0.40 0.82 0.54 346
Lung 0.00 0.00 0.00 94
Breast 0.14 0.02 0.04 174
Pancreas 0.00 0.00 0.00 82
Ovary 0.00 0.00 0.00 48
Esophagus 0.00 0.00 0.00 41
Liver 0.00 0.00 0.00 38
Stomach 0.00 0.00 0.00 60
Healthy 0.80 0.94 0.87 812
accuracy 0.62 1695
macro avg 0.15 0.20 0.16 1695
weighted avg 0.48 0.62 0.53 1695
| Scores | Std | |
|---|---|---|
| Specificity (med) | 0.9506 | 0.0235 |
| Sensitivity (med) | 0.2007 | 0.0097 |
| Sensitivity weighted (med) | 0.6224 | 0.0097 |
| AUC (med) | 0.6260 | 0.0218 |
| Specificity (mean) | 0.9421 | 0.0235 |
| Sensitivity (mean) | 0.1981 | 0.0097 |
| Sensitivity weighted (mean) | 0.6201 | 0.0097 |
| AUC (mean) | 0.6250 | 0.0218 |
60.3% <= AUC <= 64.7% ============================================================================ ================================== GradientBoosting ================================== ============================================================================
Cancer samples correctly classified (sensitivity): 688, out of 883 (77.9%) Total precision on cancer samples: 91.9%
| fit_time | score_time | test_specificity | train_specificity | test_sensitivity | train_sensitivity | test_sensitivity_w | train_sensitivity_w | test_accuracy | train_accuracy | test_roc_auc | train_roc_auc | test_roc_auc_w | train_roc_auc_w | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 16.298713 | 0.198709 | 0.925926 | 1.0 | 0.225720 | 0.857555 | 0.552941 | 0.908197 | 0.552941 | 0.908197 | 0.561332 | 0.976804 | 0.574504 | 0.959907 |
| 1 | 16.306875 | 0.193651 | 0.962963 | 1.0 | 0.217049 | 0.855920 | 0.576471 | 0.907541 | 0.576471 | 0.907541 | 0.523261 | 0.975369 | 0.552521 | 0.958022 |
| 2 | 16.177710 | 0.175108 | 0.950617 | 1.0 | 0.203640 | 0.857336 | 0.564706 | 0.905574 | 0.564706 | 0.905574 | 0.497389 | 0.974141 | 0.535292 | 0.956427 |
| 3 | 16.171883 | 0.197143 | 0.925926 | 1.0 | 0.210981 | 0.855837 | 0.552941 | 0.908197 | 0.552941 | 0.908197 | 0.544879 | 0.975590 | 0.570185 | 0.958464 |
| 4 | 12.967130 | 0.237553 | 0.925926 | 1.0 | 0.197699 | 0.860770 | 0.564706 | 0.907541 | 0.564706 | 0.907541 | 0.559641 | 0.975227 | 0.586395 | 0.957741 |
| 5 | 13.134689 | 0.290546 | 0.901235 | 1.0 | 0.166648 | 0.872906 | 0.514793 | 0.915465 | 0.514793 | 0.915465 | 0.518636 | 0.977046 | 0.531727 | 0.960925 |
| 6 | 13.142718 | 0.356629 | 0.951220 | 1.0 | 0.201624 | 0.854397 | 0.562130 | 0.904325 | 0.562130 | 0.904325 | 0.557049 | 0.974533 | 0.557792 | 0.956825 |
| 7 | 13.036963 | 0.334787 | 0.963415 | 1.0 | 0.178669 | 0.854707 | 0.544379 | 0.906946 | 0.544379 | 0.906946 | 0.484525 | 0.974536 | 0.512436 | 0.956728 |
| 8 | 6.831806 | 0.122207 | 0.864198 | 1.0 | 0.189068 | 0.872568 | 0.508876 | 0.914155 | 0.508876 | 0.914155 | 0.561300 | 0.977212 | 0.578974 | 0.961171 |
| 9 | 6.770181 | 0.105370 | 0.876543 | 1.0 | 0.218527 | 0.856973 | 0.520710 | 0.908912 | 0.520710 | 0.908912 | 0.521960 | 0.975752 | 0.544973 | 0.958328 |
precision recall f1-score support
Colorectum 0.43 0.38 0.40 346
Lung 0.07 0.05 0.06 94
Breast 0.15 0.09 0.11 174
Pancreas 0.11 0.11 0.11 82
Ovary 0.20 0.19 0.20 48
Esophagus 0.00 0.00 0.00 41
Liver 0.00 0.00 0.00 38
Stomach 0.08 0.07 0.07 60
Healthy 0.79 0.92 0.85 812
accuracy 0.55 1695
macro avg 0.20 0.20 0.20 1695
weighted avg 0.50 0.55 0.52 1695
| Scores | Std | |
|---|---|---|
| Specificity (med) | 0.9259 | 0.0330 |
| Sensitivity (med) | 0.2026 | 0.0176 |
| Sensitivity weighted (med) | 0.5529 | 0.0176 |
| AUC (med) | 0.5341 | 0.0266 |
| Specificity (mean) | 0.9248 | 0.0330 |
| Sensitivity (mean) | 0.2010 | 0.0176 |
| Sensitivity weighted (mean) | 0.5463 | 0.0176 |
| AUC (mean) | 0.5330 | 0.0266 |
50.6% <= AUC <= 56.0% ============================================================================ ================================== CatBoost ================================== ============================================================================
Cancer samples correctly classified (sensitivity): 685, out of 883 (77.6%) Total precision on cancer samples: 94.0%
| fit_time | score_time | test_specificity | train_specificity | test_sensitivity | train_sensitivity | test_sensitivity_w | train_sensitivity_w | test_accuracy | train_accuracy | test_roc_auc | train_roc_auc | test_roc_auc_w | train_roc_auc_w | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.559939 | 0.325447 | 0.950617 | 0.960328 | 0.222908 | 0.219719 | 0.629412 | 0.636721 | 0.629412 | 0.636721 | 0.622762 | 0.745183 | 0.724162 | 0.812642 |
| 1 | 2.960425 | 0.230341 | 0.975309 | 0.948016 | 0.232001 | 0.224443 | 0.652941 | 0.635410 | 0.652941 | 0.635410 | 0.647190 | 0.736590 | 0.761680 | 0.805952 |
| 2 | 3.145089 | 0.355710 | 0.950617 | 0.954856 | 0.194513 | 0.223498 | 0.617647 | 0.637377 | 0.617647 | 0.637377 | 0.578572 | 0.738788 | 0.725130 | 0.806361 |
| 3 | 3.174979 | 0.466425 | 0.950617 | 0.952120 | 0.175465 | 0.221963 | 0.582353 | 0.637377 | 0.582353 | 0.637377 | 0.629612 | 0.742072 | 0.730283 | 0.809519 |
| 4 | 2.833734 | 0.242386 | 0.938272 | 0.952120 | 0.196316 | 0.220584 | 0.617647 | 0.633443 | 0.617647 | 0.633443 | 0.635422 | 0.741267 | 0.749269 | 0.807423 |
| 5 | 2.355238 | 0.198222 | 0.913580 | 0.956224 | 0.177699 | 0.240881 | 0.579882 | 0.644168 | 0.579882 | 0.644168 | 0.614070 | 0.737905 | 0.711246 | 0.810789 |
| 6 | 2.169795 | 0.206303 | 0.963415 | 0.958904 | 0.220772 | 0.213327 | 0.639053 | 0.632372 | 0.639053 | 0.632372 | 0.662250 | 0.734964 | 0.775007 | 0.803472 |
| 7 | 2.081398 | 0.157535 | 0.975610 | 0.953425 | 0.225395 | 0.228960 | 0.644970 | 0.638925 | 0.644970 | 0.638925 | 0.604216 | 0.742516 | 0.751905 | 0.807261 |
| 8 | 0.827216 | 0.089083 | 0.925926 | 0.948016 | 0.191116 | 0.225291 | 0.603550 | 0.637615 | 0.603550 | 0.637615 | 0.648816 | 0.736136 | 0.759411 | 0.807871 |
| 9 | 0.852788 | 0.091278 | 0.913580 | 0.946648 | 0.246897 | 0.218439 | 0.627219 | 0.635649 | 0.627219 | 0.635649 | 0.647715 | 0.741339 | 0.759961 | 0.809205 |
precision recall f1-score support
Colorectum 0.40 0.79 0.53 346
Lung 0.00 0.00 0.00 94
Breast 0.11 0.02 0.04 174
Pancreas 0.00 0.00 0.00 82
Ovary 0.75 0.12 0.21 48
Esophagus 0.00 0.00 0.00 41
Liver 0.00 0.00 0.00 38
Stomach 0.00 0.00 0.00 60
Healthy 0.80 0.95 0.86 812
accuracy 0.62 1695
macro avg 0.23 0.21 0.18 1695
weighted avg 0.50 0.62 0.53 1695
| Scores | Std | |
|---|---|---|
| Specificity (med) | 0.9506 | 0.0216 |
| Sensitivity (med) | 0.2085 | 0.0231 |
| Sensitivity weighted (med) | 0.6224 | 0.0231 |
| AUC (med) | 0.6325 | 0.0237 |
| Specificity (mean) | 0.9458 | 0.0216 |
| Sensitivity (mean) | 0.2083 | 0.0231 |
| Sensitivity weighted (mean) | 0.6195 | 0.0231 |
| AUC (mean) | 0.6291 | 0.0237 |
60.5% <= AUC <= 65.3% ============================================================================ ================================== LightGBM ================================== ============================================================================
Cancer samples correctly classified (sensitivity): 694, out of 883 (78.6%) Total precision on cancer samples: 93.3%
| fit_time | score_time | test_specificity | train_specificity | test_sensitivity | train_sensitivity | test_sensitivity_w | train_sensitivity_w | test_accuracy | train_accuracy | test_roc_auc | train_roc_auc | test_roc_auc_w | train_roc_auc_w | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.400482 | 0.620472 | 0.938272 | 0.967168 | 0.209852 | 0.364414 | 0.576471 | 0.685246 | 0.576471 | 0.685246 | 0.583124 | 0.846637 | 0.682571 | 0.868550 |
| 1 | 1.394569 | 0.603899 | 0.962963 | 0.964432 | 0.236978 | 0.356640 | 0.623529 | 0.685246 | 0.623529 | 0.685246 | 0.649786 | 0.844133 | 0.742357 | 0.865747 |
| 2 | 1.410660 | 0.591104 | 0.950617 | 0.964432 | 0.238428 | 0.338393 | 0.635294 | 0.681967 | 0.635294 | 0.681967 | 0.585559 | 0.845956 | 0.694963 | 0.866967 |
| 3 | 1.371865 | 0.584621 | 0.925926 | 0.964432 | 0.160024 | 0.364912 | 0.547059 | 0.692459 | 0.547059 | 0.692459 | 0.599085 | 0.847731 | 0.702768 | 0.869103 |
| 4 | 1.406576 | 0.552193 | 0.950617 | 0.963064 | 0.214938 | 0.369907 | 0.629412 | 0.684590 | 0.629412 | 0.684590 | 0.588403 | 0.843186 | 0.696304 | 0.864529 |
| 5 | 1.366962 | 0.518938 | 0.901235 | 0.957592 | 0.176514 | 0.384243 | 0.568047 | 0.692005 | 0.568047 | 0.692005 | 0.608645 | 0.848293 | 0.696689 | 0.870538 |
| 6 | 1.362691 | 0.511814 | 0.951220 | 0.957534 | 0.257252 | 0.353057 | 0.633136 | 0.682176 | 0.633136 | 0.682176 | 0.678910 | 0.837862 | 0.768350 | 0.861856 |
| 7 | 1.427857 | 0.525134 | 0.963415 | 0.961644 | 0.220118 | 0.368702 | 0.603550 | 0.684797 | 0.603550 | 0.684797 | 0.579170 | 0.845068 | 0.692805 | 0.864362 |
| 8 | 1.194770 | 0.395210 | 0.938272 | 0.960328 | 0.195846 | 0.351910 | 0.585799 | 0.683486 | 0.585799 | 0.683486 | 0.666800 | 0.846574 | 0.744295 | 0.869628 |
| 9 | 1.110041 | 0.390055 | 0.901235 | 0.958960 | 0.245163 | 0.366979 | 0.615385 | 0.684142 | 0.615385 | 0.684142 | 0.590531 | 0.845405 | 0.708594 | 0.866513 |
precision recall f1-score support
Colorectum 0.43 0.67 0.52 346
Lung 0.00 0.00 0.00 94
Breast 0.21 0.07 0.10 174
Pancreas 0.11 0.05 0.07 82
Ovary 0.32 0.21 0.25 48
Esophagus 0.00 0.00 0.00 41
Liver 0.00 0.00 0.00 38
Stomach 0.05 0.02 0.02 60
Healthy 0.80 0.94 0.86 812
accuracy 0.60 1695
macro avg 0.21 0.22 0.20 1695
weighted avg 0.51 0.60 0.54 1695
| Scores | Std | |
|---|---|---|
| Specificity (med) | 0.9444 | 0.0215 |
| Sensitivity (med) | 0.2175 | 0.0294 |
| Sensitivity weighted (med) | 0.6095 | 0.0294 |
| AUC (med) | 0.5948 | 0.0356 |
| Specificity (mean) | 0.9384 | 0.0215 |
| Sensitivity (mean) | 0.2155 | 0.0294 |
| Sensitivity weighted (mean) | 0.6018 | 0.0294 |
| AUC (mean) | 0.6130 | 0.0356 |
57.7% <= AUC <= 64.9% ============================================================================ ================================== XGBoost ================================== ============================================================================
Cancer samples correctly classified (sensitivity): 680, out of 883 (77.0%) Total precision on cancer samples: 93.3%
| fit_time | score_time | test_specificity | train_specificity | test_sensitivity | train_sensitivity | test_sensitivity_w | train_sensitivity_w | test_accuracy | train_accuracy | test_roc_auc | train_roc_auc | test_roc_auc_w | train_roc_auc_w | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.221928 | 0.213279 | 0.938272 | 0.979480 | 0.219376 | 0.384668 | 0.594118 | 0.713443 | 0.594118 | 0.713443 | 0.609096 | 0.856536 | 0.706989 | 0.878074 |
| 1 | 1.226665 | 0.186831 | 0.975309 | 0.972640 | 0.212953 | 0.363743 | 0.617647 | 0.704262 | 0.617647 | 0.704262 | 0.630803 | 0.847828 | 0.742506 | 0.872587 |
| 2 | 1.237409 | 0.183150 | 0.950617 | 0.972640 | 0.224275 | 0.355328 | 0.617647 | 0.702295 | 0.617647 | 0.702295 | 0.578831 | 0.848925 | 0.721066 | 0.871609 |
| 3 | 1.207728 | 0.285492 | 0.938272 | 0.978112 | 0.172110 | 0.387046 | 0.552941 | 0.718689 | 0.552941 | 0.718689 | 0.612325 | 0.850059 | 0.722830 | 0.874602 |
| 4 | 1.179429 | 0.255685 | 0.938272 | 0.974008 | 0.216927 | 0.382874 | 0.623529 | 0.710820 | 0.623529 | 0.710820 | 0.615814 | 0.849017 | 0.739100 | 0.872431 |
| 5 | 1.264557 | 0.350313 | 0.901235 | 0.972640 | 0.157280 | 0.384675 | 0.538462 | 0.711009 | 0.538462 | 0.711009 | 0.591765 | 0.854301 | 0.688021 | 0.878587 |
| 6 | 1.242662 | 0.353087 | 0.951220 | 0.973973 | 0.228494 | 0.359301 | 0.621302 | 0.702490 | 0.621302 | 0.702490 | 0.682998 | 0.845615 | 0.775974 | 0.870038 |
| 7 | 1.289602 | 0.541571 | 0.963415 | 0.972603 | 0.189562 | 0.381321 | 0.597633 | 0.708388 | 0.597633 | 0.708388 | 0.597017 | 0.848876 | 0.747365 | 0.871600 |
| 8 | 1.926997 | 0.441700 | 0.925926 | 0.972640 | 0.201011 | 0.364524 | 0.591716 | 0.707077 | 0.591716 | 0.707077 | 0.667002 | 0.853561 | 0.760233 | 0.877501 |
| 9 | 1.677904 | 0.382762 | 0.913580 | 0.975376 | 0.242903 | 0.370001 | 0.609467 | 0.710354 | 0.609467 | 0.710354 | 0.609315 | 0.849417 | 0.728393 | 0.873886 |
precision recall f1-score support
Colorectum 0.42 0.64 0.51 346
Lung 0.00 0.00 0.00 94
Breast 0.16 0.07 0.10 174
Pancreas 0.19 0.09 0.12 82
Ovary 0.35 0.12 0.18 48
Esophagus 0.00 0.00 0.00 41
Liver 0.00 0.00 0.00 38
Stomach 0.00 0.00 0.00 60
Healthy 0.79 0.94 0.86 812
accuracy 0.60 1695
macro avg 0.21 0.21 0.20 1695
weighted avg 0.50 0.60 0.54 1695
| Scores | Std | |
|---|---|---|
| Specificity (med) | 0.9383 | 0.0211 |
| Sensitivity (med) | 0.2149 | 0.0252 |
| Sensitivity weighted (med) | 0.6036 | 0.0252 |
| AUC (med) | 0.6108 | 0.0310 |
| Specificity (mean) | 0.9396 | 0.0211 |
| Sensitivity (mean) | 0.2065 | 0.0252 |
| Sensitivity weighted (mean) | 0.5964 | 0.0252 |
| AUC (mean) | 0.6195 | 0.0310 |
58.9% <= AUC <= 65.1%
Plot shap values for supported classifiers.
# Select predictors and target variable
Xa = sh_data[aneuploidy]
Ya = sh_data['Tumor type']
# Split into train and test sets
trainXa, testXa, trainYa, testYa = train_test_split(Xa, Ya, test_size=0.2, random_state=89, stratify=Ya)
# Transform the train and test set in the same way as in the pipeline
pt = PercentileTransformer()
sc = StandardScaler()
pt.fit(trainXa, trainYa)
sc.fit(trainXa)
testXa = pt.transform(testXa)
testXa = sc.transform(testXa)
# Creat dataframe for visualisation purpose
testXa = pd.DataFrame(testXa, columns=aneuploidy)
for name in names[5:]:
print('\n\n============================================================================')
print(f'================================== {name} ==================================')
print('============================================================================')
shap_values = shap.TreeExplainer(best_models_A[name],
feature_perturbation="tree_path_dependent").shap_values(testXa)[1]
shap.summary_plot(shap_values, testXa)
============================================================================ ================================== CatBoost ================================== ============================================================================
Setting feature_perturbation = "tree_path_dependent" because no background data was given.
============================================================================ ================================== LightGBM ================================== ============================================================================
============================================================================ ================================== XGBoost ================================== ============================================================================
The results for the individual models, and more specifically specificity, are considerabel lower than with all ten features. At most, some 78% of the cancer samples are correctly classified with precision hovering around 94%. This is achieved with KNN, Random Forest, CatBoost and LightGBM.
Aneuploidy¶Try VotingClassifier with above estimators. Use voting="soft" as that often works better for well calibrated models. More weight to more performant models.
%%time
# Select predictors and target variable
Xa = sh_data[aneuploidy]
Ya = sh_data['Tumor type']
# Split into train and test sets
trainXa, testXa, trainYa, testYa = train_test_split(Xa, Ya, test_size=0.2, random_state=89, stratify=Ya)
# Create Voting Classifier with four estimators
vtclf_A = ensemble.VotingClassifier(estimators=[e for e in zip(best_models.keys(),
best_models.values())],
voting='soft', weights=[1, 3, 1, 4, 2, 4, 4, 3], n_jobs=-1)
# Create pipeline
vtclf_pipeline_A = Pipeline(steps=[('aneuploidy_pipeline', aneuploidy_pipeline),
('vtclf', vtclf_A)], verbose=3)
vtclf_pipeline_A.fit(trainXa, trainYa)
# Cross validate above parameter tuning
cvScores_votingA = crossVal(vtclf_pipeline_A, Xa, Ya, cv_folds=10)
[Pipeline] (step 1 of 2) Processing aneuploidy_pipeline, total= 0.0s [Pipeline] ............. (step 2 of 2) Processing vtclf, total= 15.1s
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 1.0min remaining: 26.1s [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 1.3min finished
Model report Cross Validated scores Specificity (test): 0.9445 Sensitivity weighted (test): 0.6077 Sensitivity (test): 0.2103 AUC (train): 0.88 AUC (test): 0.6267 CPU times: user 1.13 s, sys: 442 ms, total: 1.57 s Wall time: 1min 32s
pd.DataFrame(cvScores_votingA)
| fit_time | score_time | test_specificity | train_specificity | test_sensitivity | train_sensitivity | test_sensitivity_w | train_sensitivity_w | test_accuracy | train_accuracy | test_roc_auc | train_roc_auc | test_roc_auc_w | train_roc_auc_w | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 20.688713 | 0.694068 | 0.950617 | 0.983584 | 0.214575 | 0.375923 | 0.594118 | 0.720656 | 0.594118 | 0.720656 | 0.606580 | 0.886552 | 0.714735 | 0.895302 |
| 1 | 18.180848 | 0.723357 | 0.975309 | 0.982216 | 0.219302 | 0.365250 | 0.629412 | 0.721311 | 0.629412 | 0.721311 | 0.639954 | 0.879155 | 0.752883 | 0.890212 |
| 2 | 22.312046 | 0.741180 | 0.950617 | 0.976744 | 0.210386 | 0.336182 | 0.611765 | 0.706230 | 0.611765 | 0.706230 | 0.597867 | 0.883735 | 0.731550 | 0.892873 |
| 3 | 22.663104 | 0.681223 | 0.938272 | 0.989056 | 0.164757 | 0.388623 | 0.552941 | 0.733115 | 0.552941 | 0.733115 | 0.636960 | 0.882058 | 0.738120 | 0.892464 |
| 4 | 22.840541 | 0.884284 | 0.938272 | 0.982216 | 0.223277 | 0.396649 | 0.635294 | 0.728525 | 0.635294 | 0.728525 | 0.616181 | 0.879745 | 0.735498 | 0.889991 |
| 5 | 22.395067 | 0.900206 | 0.925926 | 0.983584 | 0.172909 | 0.381049 | 0.568047 | 0.724771 | 0.568047 | 0.724771 | 0.616069 | 0.886377 | 0.709154 | 0.896940 |
| 6 | 21.761761 | 0.853826 | 0.951220 | 0.984932 | 0.235030 | 0.364609 | 0.639053 | 0.721494 | 0.639053 | 0.721494 | 0.667884 | 0.877966 | 0.770819 | 0.888637 |
| 7 | 21.189285 | 0.865835 | 0.975610 | 0.980822 | 0.215591 | 0.367996 | 0.633136 | 0.717562 | 0.633136 | 0.717562 | 0.598975 | 0.877215 | 0.737343 | 0.888634 |
| 8 | 12.694949 | 0.390240 | 0.925926 | 0.982216 | 0.201011 | 0.354851 | 0.591716 | 0.714286 | 0.591716 | 0.714286 | 0.657679 | 0.892137 | 0.760183 | 0.900553 |
| 9 | 12.488997 | 0.392995 | 0.913580 | 0.982216 | 0.246534 | 0.367250 | 0.621302 | 0.720183 | 0.621302 | 0.720183 | 0.629223 | 0.881386 | 0.747966 | 0.891814 |
# Make predictions with the Voting Classifier on the entire dataset
preds_voting_A = cross_val_predict(vtclf_pipeline_A, Xa, Ya, cv=10, verbose=3, n_jobs=-1)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 45.2s remaining: 19.4s [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 56.5s finished
# Plot Confusion Matrix
plot_confusion_matrix(Ya, preds_voting_A, target_names=[i for i in cancers[:9]],
title='Confusion Matrix')
# Plot sensitivities
plot_sensitivities(preds_voting_A, Ya, title='Sensitivity per Cancer Type')
cSamples = sum((preds_voting_A != 9) & (Ya != 9))
ccSamples = sum(preds_voting_A != 9)
tot = sum(Ya != 9)
print(f'Cancer samples correctly classified (sensitivity): {cSamples}, out of {tot} ({cSamples/tot:.1%})')
print(f'Total precision on cancer samples: {cSamples/ccSamples:.1%}')
Cancer samples correctly classified (sensitivity): 674, out of 883 (76.3%) Total precision on cancer samples: 93.7%
print(classification_report(Ya, preds_voting_A, target_names=cancers[:9]))
precision recall f1-score support
Colorectum 0.42 0.70 0.53 346
Lung 0.00 0.00 0.00 94
Breast 0.17 0.07 0.10 174
Pancreas 0.17 0.05 0.08 82
Ovary 0.39 0.15 0.21 48
Esophagus 0.00 0.00 0.00 41
Liver 0.00 0.00 0.00 38
Stomach 0.00 0.00 0.00 60
Healthy 0.79 0.94 0.86 812
accuracy 0.61 1695
macro avg 0.22 0.21 0.20 1695
weighted avg 0.50 0.61 0.54 1695
# Print performance
performance_votingA = cv_score_summary(cvScores_votingA)
display(performance_votingA)
# Calculate AUC with standard deviations
med = performance_votingA.loc['AUC (mean)', 'Scores']
std = performance_votingA.loc['AUC (mean)', 'Std']
print("{:.1%} <= AUC <= {:.1%}".format(med-std, med+std))
| Scores | Std | |
|---|---|---|
| Specificity (med) | 0.9444 | 0.0194 |
| Sensitivity (med) | 0.2151 | 0.0240 |
| Sensitivity weighted (med) | 0.6165 | 0.0240 |
| AUC (med) | 0.6227 | 0.0227 |
| Specificity (mean) | 0.9445 | 0.0194 |
| Sensitivity (mean) | 0.2103 | 0.0240 |
| Sensitivity weighted (mean) | 0.6077 | 0.0240 |
| AUC (mean) | 0.6267 | 0.0227 |
60.4% <= AUC <= 64.9%
Combining several estimators by voting has not improved performance.
Run several experiments by stacking previously trained models to combine a larger, hopefully, more powerful model. Start by using Logistic Regression as final estimator in the StackingClassifier.
# Select only the better performing models for Stacking. Remove Random Forest and SVC due to lower performance
to_stack = copy.deepcopy(best_models)
del to_stack['RandomForest']
del to_stack['LogisticRegression']
del to_stack['SVC']
to_stack
Finished loading model, total used 400 iterations
{'KNN': KNeighborsClassifier(algorithm='auto', leaf_size=2, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=8, p=1,
weights='uniform'),
'GradientBoosting': GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
learning_rate=0.1, loss='deviance', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=600,
n_iter_no_change=None, presort='deprecated',
random_state=None, subsample=1.0, tol=0.0001,
validation_fraction=0.1, verbose=0,
warm_start=False),
'CatBoost': <catboost.core.CatBoostClassifier at 0x1c24fe1550>,
'LightGBM': LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
importance_type='split', learning_rate=0.1, max_depth=3,
min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
n_estimators=400, n_jobs=-1, num_leaves=8,
objective='multiclass', random_state=None, reg_alpha=0.0,
reg_lambda=0.0, silent=True, subsample=1.0,
subsample_for_bin=200000, subsample_freq=0),
'XGBoost': XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=0.5, eval_metric='merror',
gamma=0, gpu_id=-1, importance_type='gain',
interaction_constraints=None, learning_rate=0.1, max_delta_step=0,
max_depth=4, min_child_weight=1, missing=nan,
monotone_constraints=None, n_estimators=200, n_jobs=0,
num_parallel_tree=1, objective='multi:softprob', random_state=29,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=29,
subsample=0.8, tree_method=None, validate_parameters=False,
verbosity=None)}
%%time
# Select predictors and target variable
Xa = sh_data[aneuploidy]
Ya = sh_data['Tumor type']
# Split into train and test sets
trainXa, testXa, trainYa, testYa = train_test_split(Xa, Ya, test_size=0.2, random_state=89, stratify=Ya)
# Create a final meta estimator for the stacking classifier
meta_estimator = LogisticRegression(max_iter=1000)
# Create Stacking Classifier with four estimators
stclf_A = ensemble.StackingClassifier(estimators=[e for e in zip(to_stack.keys(),
to_stack.values())],
final_estimator=meta_estimator, passthrough=True,
cv=10, n_jobs=-1)
# Create pipeline
stclf_pipeline_A = Pipeline(steps=[('aneuploidy_pipeline', aneuploidy_pipeline),
('stacking_clf', stclf_A)], verbose=3)
stclf_pipeline_A.fit(trainXa, trainYa)
# Cross validate above parameter tuning
cvScores_stacking_A = crossVal(stclf_pipeline_A, Xa, Ya, cv_folds=10)
[Pipeline] (step 1 of 2) Processing aneuploidy_pipeline, total= 0.0s [Pipeline] ...... (step 2 of 2) Processing stacking_clf, total= 1.6min
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 7.6min remaining: 3.3min [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 9.7min finished
Model report Cross Validated scores Sensitivity weighted (test): 0.6283 Sensitivity (test): 0.2115 AUC (train): 0.60 AUC (test): 0.6248 CPU times: user 959 ms, sys: 303 ms, total: 1.26 s Wall time: 11min 19s
pd.DataFrame(cvScores_stacking_A)
| fit_time | score_time | test_specificity | train_specificity | test_sensitivity | train_sensitivity | test_sensitivity_w | train_sensitivity_w | test_accuracy | train_accuracy | test_roc_auc | train_roc_auc | test_roc_auc_w | train_roc_auc_w | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 226.267696 | 0.531158 | 0.938272 | 0.963064 | 0.233882 | 0.226892 | 0.629412 | 0.636721 | 0.629412 | 0.636721 | 0.612743 | 0.595058 | 0.725804 | 0.737327 |
| 1 | 226.051529 | 0.593446 | 0.987654 | 0.949384 | 0.224025 | 0.218499 | 0.647059 | 0.620984 | 0.647059 | 0.620984 | 0.624578 | 0.594999 | 0.749628 | 0.723641 |
| 2 | 226.138777 | 0.556594 | 0.950617 | 0.952120 | 0.197688 | 0.209184 | 0.623529 | 0.621639 | 0.623529 | 0.621639 | 0.587806 | 0.574800 | 0.724574 | 0.716180 |
| 3 | 225.908781 | 0.641653 | 0.962963 | 0.953488 | 0.180012 | 0.221375 | 0.594118 | 0.632131 | 0.594118 | 0.632131 | 0.586795 | 0.617789 | 0.707251 | 0.735982 |
| 4 | 222.168645 | 0.756659 | 0.950617 | 0.953488 | 0.194513 | 0.235960 | 0.617647 | 0.638689 | 0.617647 | 0.638689 | 0.641303 | 0.600920 | 0.753306 | 0.738035 |
| 5 | 221.943990 | 0.752993 | 0.925926 | 0.953488 | 0.198305 | 0.228422 | 0.615385 | 0.633683 | 0.615385 | 0.633683 | 0.624736 | 0.562318 | 0.719477 | 0.723761 |
| 6 | 221.485871 | 0.711498 | 0.951220 | 0.958904 | 0.225952 | 0.214372 | 0.639053 | 0.627785 | 0.639053 | 0.627785 | 0.638201 | 0.606771 | 0.763791 | 0.736777 |
| 7 | 221.391182 | 0.719323 | 0.975610 | 0.950685 | 0.209708 | 0.232975 | 0.644970 | 0.631717 | 0.644970 | 0.631717 | 0.603477 | 0.624707 | 0.749839 | 0.738574 |
| 8 | 119.709262 | 0.547964 | 0.962963 | 0.957592 | 0.198499 | 0.225900 | 0.627219 | 0.636304 | 0.627219 | 0.636304 | 0.645021 | 0.606589 | 0.760965 | 0.741288 |
| 9 | 119.460064 | 0.572272 | 0.938272 | 0.953488 | 0.252909 | 0.207803 | 0.644970 | 0.627785 | 0.644970 | 0.627785 | 0.683231 | 0.617106 | 0.781826 | 0.748092 |
# Make predictions with the Stacking Classifier on the entire dataset
preds_stacking_A = cross_val_predict(stclf_pipeline_A, Xa, Ya, cv=10, verbose=3, n_jobs=-1)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 8.5min remaining: 3.7min [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 11.2min finished
# Plot Confusion Matrix
plot_confusion_matrix(Ya, preds_stacking_A, target_names=[i for i in cancers[:9]],
title='Confusion Matrix')
# Plot sensitivities
plot_sensitivities(preds_stacking_A, Ya, title='Sensitivity per Cancer Type')
cSamples = sum((preds_stacking_A != 9) & (Ya != 9))
ccSamples = sum(preds_stacking_A != 9)
tot = sum(Ya != 9)
print(f'Cancer samples correctly classified (sensitivity): {cSamples}, out of {tot} ({cSamples/tot:.1%})')
print(f'Total precision on cancer samples: {cSamples/ccSamples:.1%}')
Cancer samples correctly classified (sensitivity): 668, out of 883 (75.7%) Total precision on cancer samples: 94.8%
print(classification_report(Ya, preds_stacking_A, target_names=cancers[:9]))
precision recall f1-score support
Colorectum 0.42 0.80 0.55 346
Lung 0.33 0.01 0.02 94
Breast 0.21 0.03 0.06 174
Pancreas 0.00 0.00 0.00 82
Ovary 0.45 0.10 0.17 48
Esophagus 0.00 0.00 0.00 41
Liver 0.00 0.00 0.00 38
Stomach 0.00 0.00 0.00 60
Healthy 0.78 0.95 0.86 812
accuracy 0.63 1695
macro avg 0.24 0.21 0.18 1695
weighted avg 0.51 0.63 0.54 1695
Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
# Print performance
performance_stacking_A = cv_score_summary(cvScores_stacking_A)
display(performance_stacking_A)
# Calculate AUC with standard deviations
med = performance_stacking_A.loc['AUC (mean)', 'Scores']
std = performance_stacking_A.loc['AUC (mean)', 'Std']
print("{:.1%} <= AUC <= {:.1%}".format(med-std, med+std))
| Scores | Std | |
|---|---|---|
| Specificity (med) | 0.9509 | 0.0175 |
| Sensitivity (med) | 0.2041 | 0.0210 |
| Sensitivity weighted (med) | 0.6283 | 0.0210 |
| AUC (med) | 0.6247 | 0.0277 |
| Specificity (mean) | 0.9544 | 0.0175 |
| Sensitivity (mean) | 0.2115 | 0.0210 |
| Sensitivity weighted (mean) | 0.6283 | 0.0210 |
| AUC (mean) | 0.6248 | 0.0277 |
59.7% <= AUC <= 65.2%
Use Random Forest as final estimator. Set n_estimators=300 and max_depth=4 based on generally good performance in earlier experiments and to prevent tuning the hyper parameters excessively.
%%time
# Select predictors and target variable
Xa = sh_data[aneuploidy]
Ya = sh_data['Tumor type']
# Split into train and test sets
trainXa, testXa, trainYa, testYa = train_test_split(Xa, Ya, test_size=0.2, random_state=89, stratify=Ya)
# Create a final meta estimator for the stacking classifier
meta_estimator = ensemble.RandomForestClassifier(n_estimators=300,
max_depth=4)
# Create Stacking Classifier with four estimators
stclf2_A = ensemble.StackingClassifier(estimators=[e for e in zip(to_stack.keys(),
to_stack.values())],
final_estimator=meta_estimator, passthrough=True,
cv=10, n_jobs=-1)
# Create pipeline
stclf_pipeline2_A = Pipeline(steps=[('aneuploidy_pipeline', aneuploidy_pipeline),
('stacking_clf', stclf2_A)], verbose=3)
stclf_pipeline2_A.fit(trainXa, trainYa)
# Cross validate above parameter tuning
cvScores_stacking2_A = crossVal(stclf_pipeline2_A, Xa, Ya, cv_folds=10)
[Pipeline] (step 1 of 2) Processing aneuploidy_pipeline, total= 0.0s [Pipeline] ...... (step 2 of 2) Processing stacking_clf, total= 1.9min
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 9.0min remaining: 3.9min [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 11.5min finished
Model report Cross Validated scores Sensitivity weighted (test): 0.6254 Sensitivity (test): 0.2053 AUC (train): 0.68 AUC (test): 0.6194 CPU times: user 1.37 s, sys: 331 ms, total: 1.7 s Wall time: 13min 25s
pd.DataFrame(cvScores_stacking2_A)
| fit_time | score_time | test_specificity | train_specificity | test_sensitivity | train_sensitivity | test_sensitivity_w | train_sensitivity_w | test_accuracy | train_accuracy | test_roc_auc | train_roc_auc | test_roc_auc_w | train_roc_auc_w | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 255.930240 | 1.354395 | 0.950617 | 0.956224 | 0.226259 | 0.213844 | 0.641176 | 0.635410 | 0.641176 | 0.635410 | 0.621891 | 0.695033 | 0.724234 | 0.787207 |
| 1 | 256.697202 | 0.782959 | 0.975309 | 0.942544 | 0.225828 | 0.210812 | 0.647059 | 0.630164 | 0.647059 | 0.630164 | 0.615738 | 0.678184 | 0.743032 | 0.768908 |
| 2 | 256.023885 | 1.297548 | 0.950617 | 0.946648 | 0.204037 | 0.216793 | 0.635294 | 0.634098 | 0.635294 | 0.634098 | 0.586627 | 0.671111 | 0.725206 | 0.772170 |
| 3 | 258.316789 | 1.553517 | 0.938272 | 0.950752 | 0.183617 | 0.218657 | 0.594118 | 0.636721 | 0.594118 | 0.636721 | 0.569105 | 0.685510 | 0.699313 | 0.775893 |
| 4 | 269.129533 | 0.624074 | 0.925926 | 0.954856 | 0.201293 | 0.215923 | 0.623529 | 0.635410 | 0.623529 | 0.635410 | 0.600031 | 0.675117 | 0.733827 | 0.770199 |
| 5 | 268.387785 | 0.629265 | 0.913580 | 0.948016 | 0.190398 | 0.216950 | 0.603550 | 0.634993 | 0.603550 | 0.634993 | 0.596129 | 0.688067 | 0.702400 | 0.788349 |
| 6 | 256.016202 | 0.819065 | 0.951220 | 0.945205 | 0.222685 | 0.212581 | 0.639053 | 0.633683 | 0.639053 | 0.633683 | 0.656664 | 0.666665 | 0.775086 | 0.767751 |
| 7 | 258.912715 | 0.943983 | 0.975610 | 0.946575 | 0.203172 | 0.213537 | 0.644970 | 0.631717 | 0.644970 | 0.631717 | 0.625659 | 0.689163 | 0.760367 | 0.780547 |
| 8 | 150.352724 | 0.346798 | 0.913580 | 0.949384 | 0.196280 | 0.216789 | 0.609467 | 0.633683 | 0.609467 | 0.633683 | 0.667713 | 0.662796 | 0.757638 | 0.769145 |
| 9 | 150.490368 | 0.453688 | 0.913580 | 0.945280 | 0.199548 | 0.201185 | 0.615385 | 0.629751 | 0.615385 | 0.629751 | 0.654494 | 0.682389 | 0.768680 | 0.787693 |
# Make predictions with the Stacking Classifier on the entire dataset
preds_stacking2_A = cross_val_predict(stclf_pipeline2_A, Xa, Ya, cv=10, verbose=3, n_jobs=-1)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 9.6min remaining: 4.1min [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 11.8min finished
# Plot Confusion Matrix
plot_confusion_matrix(Ya, preds_stacking2_A, target_names=[i for i in cancers[:9]],
title='Confusion Matrix')
# Plot sensitivities
plot_sensitivities(preds_stacking2_A, Ya, title='Sensitivity per Cancer Type')
cSamples = sum((preds_stacking2_A != 9) & (Ya != 9))
ccSamples = sum(preds_stacking2_A != 9)
tot = sum(Ya != 9)
print(f'Cancer samples correctly classified (sensitivity): {cSamples}, out of {tot} ({cSamples/tot:.1%})')
print(f'Total precision on cancer samples: {cSamples/ccSamples:.1%}')
Cancer samples correctly classified (sensitivity): 703, out of 883 (79.6%) Total precision on cancer samples: 93.7%
print(classification_report(Ya, preds_stacking2_A, target_names=cancers[:9]))
precision recall f1-score support
Colorectum 0.40 0.85 0.54 346
Lung 0.00 0.00 0.00 94
Breast 0.25 0.01 0.01 174
Pancreas 0.00 0.00 0.00 82
Ovary 1.00 0.06 0.12 48
Esophagus 0.00 0.00 0.00 41
Liver 0.00 0.00 0.00 38
Stomach 0.00 0.00 0.00 60
Healthy 0.81 0.94 0.87 812
accuracy 0.63 1695
macro avg 0.27 0.21 0.17 1695
weighted avg 0.52 0.63 0.53 1695
Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
# Print performance
performance_stacking2_A = cv_score_summary(cvScores_stacking2_A)
display(performance_stacking2_A)
# Calculate AUC with standard deviations
med = performance_stacking2_A.loc['AUC (mean)', 'Scores']
std = performance_stacking2_A.loc['AUC (mean)', 'Std']
print("{:.1%} <= AUC <= {:.1%}".format(med-std, med+std))
| Scores | Std | |
|---|---|---|
| Specificity (med) | 0.9444 | 0.0227 |
| Sensitivity (med) | 0.2022 | 0.0141 |
| Sensitivity weighted (med) | 0.6294 | 0.0141 |
| AUC (med) | 0.6188 | 0.0309 |
| Specificity (mean) | 0.9408 | 0.0227 |
| Sensitivity (mean) | 0.2053 | 0.0141 |
| Sensitivity weighted (mean) | 0.6254 | 0.0141 |
| AUC (mean) | 0.6194 | 0.0309 |
58.8% <= AUC <= 65.0%
Use XGBoost as final estimator and train it on both the predictions from the four estimators as well as the original data by setting passthrough = True.
%%time
# Select predictors and target variable
Xa = sh_data[aneuploidy]
Ya = sh_data['Tumor type']
# Split into train and test sets
trainXa, testXa, trainYa, testYa = train_test_split(Xa, Ya, test_size=0.2, random_state=89, stratify=Ya)
# Create a final meta estimator for the stacking classifier
meta_estimator = xgb.XGBClassifier(n_estimators=400,
max_depth=4)
# Create Stacking Classifier with four estimators
stclf3_A = ensemble.StackingClassifier(estimators=[e for e in zip(to_stack.keys(),
to_stack.values())],
final_estimator=meta_estimator, passthrough=True,
cv=10, n_jobs=-1)
# Create pipeline
stclf_pipeline3_A = Pipeline(steps=[('aneuploidy_pipeline', aneuploidy_pipeline),
('stacking_clf', stclf3_A)], verbose=3)
stclf_pipeline3_A.fit(trainXa, trainYa)
# Cross validate above parameter tuning
cvScores_stacking3_A = crossVal(stclf_pipeline3_A, Xa, Ya, cv_folds=10)
[Pipeline] (step 1 of 2) Processing aneuploidy_pipeline, total= 0.0s [Pipeline] ...... (step 2 of 2) Processing stacking_clf, total= 2.2min
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 9.0min remaining: 3.9min [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 11.2min finished
Model report Cross Validated scores Sensitivity weighted (test): 0.5941 Sensitivity (test): 0.2060 AUC (train): 0.61 AUC (test): 0.6064 CPU times: user 21.6 s, sys: 1.68 s, total: 23.3 s Wall time: 13min 30s
pd.DataFrame(cvScores_stacking3_A)
| fit_time | score_time | test_specificity | train_specificity | test_sensitivity | train_sensitivity | test_sensitivity_w | train_sensitivity_w | test_accuracy | train_accuracy | test_roc_auc | train_roc_auc | test_roc_auc_w | train_roc_auc_w | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 281.053406 | 0.912597 | 0.950617 | 0.956224 | 0.236621 | 0.220211 | 0.629412 | 0.620984 | 0.629412 | 0.620984 | 0.601236 | 0.625524 | 0.696655 | 0.723217 |
| 1 | 281.174502 | 0.825428 | 0.975309 | 0.946648 | 0.209779 | 0.218503 | 0.611765 | 0.604590 | 0.611765 | 0.604590 | 0.607206 | 0.598562 | 0.700152 | 0.690539 |
| 2 | 275.456384 | 0.819462 | 0.950617 | 0.942544 | 0.194887 | 0.213856 | 0.605882 | 0.598689 | 0.605882 | 0.598689 | 0.578472 | 0.600708 | 0.674610 | 0.697569 |
| 3 | 281.137740 | 0.781482 | 0.925926 | 0.942544 | 0.194734 | 0.216727 | 0.570588 | 0.598689 | 0.570588 | 0.598689 | 0.562436 | 0.610995 | 0.647160 | 0.695582 |
| 4 | 239.635837 | 0.588920 | 0.913580 | 0.943912 | 0.198708 | 0.230512 | 0.588235 | 0.621639 | 0.588235 | 0.621639 | 0.656645 | 0.612997 | 0.720309 | 0.714563 |
| 5 | 239.223487 | 0.813644 | 0.913580 | 0.949384 | 0.149128 | 0.232911 | 0.526627 | 0.620577 | 0.526627 | 0.620577 | 0.618909 | 0.609463 | 0.692428 | 0.709944 |
| 6 | 235.626651 | 0.894789 | 0.963415 | 0.945205 | 0.213509 | 0.219320 | 0.609467 | 0.615334 | 0.609467 | 0.615334 | 0.590106 | 0.607731 | 0.664725 | 0.695039 |
| 7 | 236.774201 | 1.049012 | 0.963415 | 0.931507 | 0.220772 | 0.223834 | 0.621302 | 0.605505 | 0.621302 | 0.605505 | 0.614538 | 0.625388 | 0.696133 | 0.704632 |
| 8 | 133.821489 | 0.426177 | 0.876543 | 0.938440 | 0.178367 | 0.231735 | 0.556213 | 0.614024 | 0.556213 | 0.614024 | 0.586877 | 0.592855 | 0.658808 | 0.683414 |
| 9 | 132.539648 | 0.396519 | 0.888889 | 0.928865 | 0.263853 | 0.225910 | 0.621302 | 0.608781 | 0.621302 | 0.608781 | 0.647723 | 0.650842 | 0.724506 | 0.730416 |
# Make predictions with the Stacking Classifier on the entire dataset
preds_stacking3_A = cross_val_predict(stclf_pipeline3_A, Xa, Ya, cv=10, verbose=3, n_jobs=-1)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 8.2min remaining: 3.5min [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 10.5min finished
# Plot Confusion Matrix
plot_confusion_matrix(Ya, preds_stacking3_A, target_names=[i for i in cancers[:9]],
title='Confusion Matrix')
# Plot sensitivities
plot_sensitivities(preds_stacking3_A, Ya, title='Sensitivity per Cancer Type')
cSamples = sum((preds_stacking3_A != 9) & (Ya != 9))
ccSamples = sum(preds_stacking3_A != 9)
tot = sum(Ya != 9)
print(f'Cancer samples correctly classified (sensitivity): {cSamples}, out of {tot} ({cSamples/tot:.1%})')
print(f'Total precision on cancer samples: {cSamples/ccSamples:.1%}')
Cancer samples correctly classified (sensitivity): 688, out of 883 (77.9%) Total precision on cancer samples: 92.6%
print(classification_report(Ya, preds_stacking3_A, target_names=cancers[:9]))
precision recall f1-score support
Colorectum 0.40 0.64 0.49 346
Lung 0.12 0.03 0.05 94
Breast 0.17 0.09 0.12 174
Pancreas 0.10 0.04 0.05 82
Ovary 0.38 0.12 0.19 48
Esophagus 0.00 0.00 0.00 41
Liver 0.00 0.00 0.00 38
Stomach 0.00 0.00 0.00 60
Healthy 0.80 0.93 0.86 812
accuracy 0.59 1695
macro avg 0.22 0.21 0.20 1695
weighted avg 0.50 0.59 0.54 1695
# Print performance
performance_stacking3_A = cv_score_summary(cvScores_stacking3_A)
display(performance_stacking3_A)
# Calculate AUC with standard deviations
med = performance_stacking3_A.loc['AUC (mean)', 'Scores']
std = performance_stacking3_A.loc['AUC (mean)', 'Std']
print("{:.1%} <= AUC <= {:.1%}".format(med-std, med+std))
| Scores | Std | |
|---|---|---|
| Specificity (med) | 0.9383 | 0.0319 |
| Sensitivity (med) | 0.2042 | 0.0298 |
| Sensitivity weighted (med) | 0.6077 | 0.0298 |
| AUC (med) | 0.6042 | 0.0280 |
| Specificity (mean) | 0.9322 | 0.0319 |
| Sensitivity (mean) | 0.2060 | 0.0298 |
| Sensitivity weighted (mean) | 0.5941 | 0.0298 |
| AUC (mean) | 0.6064 | 0.0280 |
57.8% <= AUC <= 63.4%
Use XGBoost with standard hyper parameters as final estimator in the Stacked Classifier. Using standard hyper parameters decreases the risk of overfitting on this particular dataset and is more likely to perform similarly on an independent dataset.
%%time
# Select predictors and target variable
Xa = sh_data[aneuploidy]
Ya = sh_data['Tumor type']
# Split into train and test sets
trainXa, testXa, trainYa, testYa = train_test_split(Xa, Ya, test_size=0.2, random_state=89, stratify=Ya)
# Create a final meta estimator for the stacking classifier
meta_estimator = xgb.XGBClassifier()
# Create Stacking Classifier with four estimators
stclf4_A = ensemble.StackingClassifier(estimators=[e for e in zip(to_stack.keys(),
to_stack.values())],
final_estimator=meta_estimator, passthrough=True,
cv=10, n_jobs=-1)
# Create pipeline
stclf_pipeline4_A = Pipeline(steps=[('aneuploidy_pipeline', aneuploidy_pipeline),
('stacking_clf', stclf4_A)], verbose=3)
stclf_pipeline4_A.fit(trainXa, trainYa)
# Cross validate above parameter tuning
cvScores_stacking4_A = crossVal(stclf_pipeline4_A, Xa, Ya, cv_folds=10)
[Pipeline] (step 1 of 2) Processing aneuploidy_pipeline, total= 0.0s [Pipeline] ...... (step 2 of 2) Processing stacking_clf, total= 1.8min
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 6.9min remaining: 3.0min [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 8.8min finished
Model report Cross Validated scores Sensitivity weighted (test): 0.6053 Sensitivity (test): 0.2075 AUC (train): 0.61 AUC (test): 0.6115 CPU times: user 9.06 s, sys: 463 ms, total: 9.53 s Wall time: 10min 37s
pd.DataFrame(cvScores_stacking4_A)
| fit_time | score_time | test_specificity | train_specificity | test_sensitivity | train_sensitivity | test_sensitivity_w | train_sensitivity_w | test_accuracy | train_accuracy | test_roc_auc | train_roc_auc | test_roc_auc_w | train_roc_auc_w | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 198.973890 | 0.405818 | 0.950617 | 0.952120 | 0.216559 | 0.224543 | 0.617647 | 0.625574 | 0.617647 | 0.625574 | 0.615132 | 0.642800 | 0.705990 | 0.747865 |
| 1 | 196.839574 | 0.570200 | 0.975309 | 0.945280 | 0.222301 | 0.226723 | 0.629412 | 0.617705 | 0.629412 | 0.617705 | 0.617169 | 0.586002 | 0.714430 | 0.691883 |
| 2 | 199.304039 | 0.411393 | 0.962963 | 0.945280 | 0.202607 | 0.222343 | 0.623529 | 0.618361 | 0.623529 | 0.618361 | 0.558065 | 0.592873 | 0.674002 | 0.703856 |
| 3 | 197.422318 | 0.650696 | 0.925926 | 0.942544 | 0.188198 | 0.216232 | 0.564706 | 0.602623 | 0.564706 | 0.602623 | 0.575091 | 0.614095 | 0.668913 | 0.708713 |
| 4 | 203.620958 | 0.455500 | 0.925926 | 0.945280 | 0.208390 | 0.231894 | 0.611765 | 0.626885 | 0.611765 | 0.626885 | 0.655598 | 0.604004 | 0.730401 | 0.718679 |
| 5 | 202.983304 | 0.457844 | 0.925926 | 0.952120 | 0.166746 | 0.230068 | 0.550296 | 0.627130 | 0.550296 | 0.627130 | 0.623799 | 0.608120 | 0.694006 | 0.721719 |
| 6 | 198.895910 | 0.532579 | 0.975610 | 0.949315 | 0.199251 | 0.219124 | 0.603550 | 0.619921 | 0.603550 | 0.619921 | 0.565511 | 0.608931 | 0.661391 | 0.705414 |
| 7 | 200.535090 | 0.586220 | 0.975610 | 0.941096 | 0.203172 | 0.225568 | 0.633136 | 0.614679 | 0.633136 | 0.614679 | 0.623292 | 0.616569 | 0.713591 | 0.708875 |
| 8 | 110.549288 | 0.248029 | 0.901235 | 0.938440 | 0.204803 | 0.224074 | 0.591716 | 0.610747 | 0.591716 | 0.610747 | 0.620882 | 0.591944 | 0.689672 | 0.695865 |
| 9 | 111.035201 | 0.382475 | 0.913580 | 0.939808 | 0.263328 | 0.228674 | 0.627219 | 0.612713 | 0.627219 | 0.612713 | 0.660390 | 0.651222 | 0.738320 | 0.738893 |
# Make predictions with the Stacking Classifier on the entire dataset
preds_stacking4_A = cross_val_predict(stclf_pipeline4_A, Xa, Ya, cv=10, verbose=3, n_jobs=-1)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 6.6min remaining: 2.8min [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 8.5min finished
# Plot Confusion Matrix
plot_confusion_matrix(Ya, preds_stacking4_A, target_names=[i for i in cancers[:9]],
title='Confusion Matrix')
# Plot sensitivities
plot_sensitivities(preds_stacking4_A, Ya, title='Sensitivity per Cancer Type')
cSamples = sum((preds_stacking4_A != 9) & (Ya != 9))
ccSamples = sum(preds_stacking4_A != 9)
tot = sum(Ya != 9)
print(f'Cancer samples correctly classified (sensitivity): {cSamples}, out of {tot} ({cSamples/tot:.1%})')
print(f'Total precision on cancer samples: {cSamples/ccSamples:.1%}')
Cancer samples correctly classified (sensitivity): 683, out of 883 (77.3%) Total precision on cancer samples: 93.7%
print(classification_report(Ya, preds_stacking4_A, target_names=cancers[:9]))
precision recall f1-score support
Colorectum 0.41 0.67 0.51 346
Lung 0.05 0.01 0.02 94
Breast 0.20 0.10 0.14 174
Pancreas 0.14 0.04 0.06 82
Ovary 0.38 0.10 0.16 48
Esophagus 0.00 0.00 0.00 41
Liver 0.00 0.00 0.00 38
Stomach 0.00 0.00 0.00 60
Healthy 0.79 0.94 0.86 812
accuracy 0.61 1695
macro avg 0.22 0.21 0.19 1695
weighted avg 0.51 0.61 0.54 1695
# Print performance
performance_stacking4_A = cv_score_summary(cvScores_stacking4_A)
display(performance_stacking4_A)
# Calculate AUC with standard deviations
med = performance_stacking4_A.loc['AUC (mean)', 'Scores']
std = performance_stacking4_A.loc['AUC (mean)', 'Std']
print("{:.1%} <= AUC <= {:.1%}".format(med-std, med+std))
| Scores | Std | |
|---|---|---|
| Specificity (med) | 0.9383 | 0.0267 |
| Sensitivity (med) | 0.2040 | 0.0236 |
| Sensitivity weighted (med) | 0.6147 | 0.0236 |
| AUC (med) | 0.6190 | 0.0332 |
| Specificity (mean) | 0.9433 | 0.0267 |
| Sensitivity (mean) | 0.2075 | 0.0236 |
| Sensitivity weighted (mean) | 0.6053 | 0.0236 |
| AUC (mean) | 0.6115 | 0.0332 |
57.8% <= AUC <= 64.5%
%%time
# Select predictors and target variable
Xa = sh_data[aneuploidy]
Ya = sh_data['Tumor type']
# Split into train and test sets
trainXa, testXa, trainYa, testYa = train_test_split(Xa, Ya, test_size=0.2, random_state=89, stratify=Ya)
# Create a final meta estimator for the stacking classifier
meta_estimator = CatBoostClassifier(learning_rate=0.1,
n_estimators=500,
max_depth=3,
eval_metric="MultiClass",
bootstrap_type="Bernoulli",
silent=True)
# Create Stacking Classifier with four estimators
stclf5_A = ensemble.StackingClassifier(estimators=[e for e in zip(to_stack.keys(),
to_stack.values())],
final_estimator=meta_estimator, passthrough=True,
cv=10, n_jobs=-1)
# Create pipeline
stclf_pipeline5_A = Pipeline(steps=[('aneuploidy_pipeline', aneuploidy_pipeline),
('stacking_clf', stclf5_A)], verbose=3)
stclf_pipeline5_A.fit(trainXa, trainYa)
# Cross validate above parameter tuning
cvScores_stacking5_A = crossVal(stclf_pipeline5_A, Xa, Ya, cv_folds=10)
[Pipeline] (step 1 of 2) Processing aneuploidy_pipeline, total= 0.0s [Pipeline] ...... (step 2 of 2) Processing stacking_clf, total= 2.0min
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 7.9min remaining: 3.4min [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 9.8min finished
Model report Cross Validated scores Sensitivity weighted (test): 0.6118 Sensitivity (test): 0.2034 AUC (train): 0.63 AUC (test): 0.6113 CPU times: user 10.8 s, sys: 731 ms, total: 11.5 s Wall time: 11min 50s
pd.DataFrame(cvScores_stacking5_A)
| fit_time | score_time | test_specificity | train_specificity | test_sensitivity | train_sensitivity | test_sensitivity_w | train_sensitivity_w | test_accuracy | train_accuracy | test_roc_auc | train_roc_auc | test_roc_auc_w | train_roc_auc_w | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 233.692148 | 0.657289 | 0.950617 | 0.958960 | 0.207211 | 0.223503 | 0.605882 | 0.628197 | 0.605882 | 0.628197 | 0.617933 | 0.635190 | 0.725210 | 0.752248 |
| 1 | 233.851258 | 0.500771 | 0.975309 | 0.945280 | 0.209955 | 0.211869 | 0.617647 | 0.600000 | 0.617647 | 0.600000 | 0.625437 | 0.615936 | 0.738992 | 0.729564 |
| 2 | 233.846264 | 0.525336 | 0.950617 | 0.942544 | 0.201049 | 0.215747 | 0.623529 | 0.619016 | 0.623529 | 0.619016 | 0.576369 | 0.642759 | 0.713722 | 0.749124 |
| 3 | 233.789918 | 0.582379 | 0.925926 | 0.949384 | 0.175897 | 0.224646 | 0.576471 | 0.630820 | 0.576471 | 0.630820 | 0.580376 | 0.623244 | 0.705287 | 0.736765 |
| 4 | 227.756153 | 0.619838 | 0.938272 | 0.949384 | 0.193328 | 0.224527 | 0.605882 | 0.631475 | 0.605882 | 0.631475 | 0.607783 | 0.614117 | 0.723512 | 0.738509 |
| 5 | 227.992815 | 0.615002 | 0.925926 | 0.953488 | 0.185607 | 0.227415 | 0.591716 | 0.637615 | 0.591716 | 0.637615 | 0.600468 | 0.621285 | 0.704148 | 0.751950 |
| 6 | 226.141294 | 0.638376 | 0.963415 | 0.947945 | 0.227308 | 0.209393 | 0.644970 | 0.621232 | 0.644970 | 0.621232 | 0.620872 | 0.605291 | 0.751417 | 0.732661 |
| 7 | 229.676904 | 0.670243 | 0.975610 | 0.945205 | 0.209708 | 0.221335 | 0.644970 | 0.626474 | 0.644970 | 0.626474 | 0.607714 | 0.651300 | 0.750006 | 0.753231 |
| 8 | 113.036298 | 0.309040 | 0.913580 | 0.938440 | 0.179940 | 0.228262 | 0.579882 | 0.628440 | 0.579882 | 0.628440 | 0.638992 | 0.607502 | 0.738535 | 0.736259 |
| 9 | 111.738950 | 0.494828 | 0.913580 | 0.943912 | 0.243993 | 0.228964 | 0.627219 | 0.627785 | 0.627219 | 0.627785 | 0.637351 | 0.639562 | 0.759550 | 0.752320 |
# Make predictions with the Stacking Classifier on the entire dataset
preds_stacking5_A = cross_val_predict(stclf_pipeline5_A, Xa, Ya, cv=10, verbose=3, n_jobs=-1)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 7.4min remaining: 3.2min [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 9.3min finished
# Plot Confusion Matrix
plot_confusion_matrix(Ya, preds_stacking5_A, target_names=[i for i in cancers[:9]],
title='Confusion Matrix')
# Plot sensitivities
plot_sensitivities(preds_stacking5_A, Ya, title='Sensitivity per Cancer Type')
cSamples = sum((preds_stacking5_A != 9) & (Ya != 9))
ccSamples = sum(preds_stacking5_A != 9)
tot = sum(Ya != 9)
print(f'Cancer samples correctly classified (sensitivity): {cSamples}, out of {tot} ({cSamples/tot:.1%})')
print(f'Total precision on cancer samples: {cSamples/ccSamples:.1%}')
Cancer samples correctly classified (sensitivity): 683, out of 883 (77.3%) Total precision on cancer samples: 93.7%
print(classification_report(Ya, preds_stacking5_A, target_names=cancers[:9]))
precision recall f1-score support
Colorectum 0.41 0.75 0.53 346
Lung 0.00 0.00 0.00 94
Breast 0.11 0.03 0.05 174
Pancreas 0.00 0.00 0.00 82
Ovary 0.50 0.10 0.17 48
Esophagus 0.00 0.00 0.00 41
Liver 0.00 0.00 0.00 38
Stomach 0.00 0.00 0.00 60
Healthy 0.79 0.94 0.86 812
accuracy 0.61 1695
macro avg 0.20 0.20 0.18 1695
weighted avg 0.49 0.61 0.53 1695
Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
# Print performance
performance_stacking5_A = cv_score_summary(cvScores_stacking5_A)
display(performance_stacking5_A)
# Calculate AUC with standard deviations
med = performance_stacking5_A.loc['AUC (mean)', 'Scores']
std = performance_stacking5_A.loc['AUC (mean)', 'Std']
print("{:.1%} <= AUC <= {:.1%}".format(med-std, med+std))
| Scores | Std | |
|---|---|---|
| Specificity (med) | 0.9444 | 0.0223 |
| Sensitivity (med) | 0.2041 | 0.0201 |
| Sensitivity weighted (med) | 0.6118 | 0.0201 |
| AUC (med) | 0.6129 | 0.0202 |
| Specificity (mean) | 0.9433 | 0.0223 |
| Sensitivity (mean) | 0.2034 | 0.0201 |
| Sensitivity weighted (mean) | 0.6118 | 0.0201 |
| AUC (mean) | 0.6113 | 0.0202 |
59.1% <= AUC <= 63.1%
Combining estimators through stacking has in general decreased performance compared with the single model alternatives. The only combination that increase performance is when Random Forest is used as meta estimator. 80% of all cancer samples are correctly classified while maintaining previous precision at 94%.
name = 'XGBoost'
# Select predictors and target variable
X = sh_data[numerical_features]
Y = sh_data['Tumor type']
# Split into train and test sets
trainX, testX, trainY, testY = train_test_split(X, Y, test_size=0.2, random_state=89, stratify=Y)
# Plot Confusion Matrix
plot_confusion_matrix(Y, predictions[name], target_names=[i for i in cancers[:9]],
title='Confusion Matrix')
# Plot sensitivities
plot_sensitivities(predictions[name], Y, title='Sensitivity per Cancer Type')
# Print cross-validation scores
display(pd.DataFrame(crossVal_scores[name]))
# Print classification report
print(classification_report(Y, predictions[name], target_names=cancers[:9]))
# Print the fraction of cancer/healthy samples classified
cSamples = sum((predictions[name] != 9) & (Y != 9))
ccSamples = sum(predictions[name] != 9)
tot = sum(Y != 9)
print(f'Cancer samples correctly classified (sensitivity): {cSamples}, out of {tot} ({cSamples/tot:.1%})')
print(f'Total precision on cancer samples: {cSamples/ccSamples:.1%}')
# Print performance
performance = cv_score_summary(crossVal_scores[name])
display(performance)
# Calculate AUC with standard deviations
med = performance.loc['AUC (mean)', 'Scores']
std = performance.loc['AUC (mean)', 'Std']
print("{:.1%} <= AUC <= {:.1%}".format(med-std, med+std))
| fit_time | score_time | test_specificity | train_specificity | test_sensitivity | train_sensitivity | test_sensitivity_w | train_sensitivity_w | test_accuracy | train_accuracy | test_roc_auc | train_roc_auc | test_roc_auc_w | train_roc_auc_w | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.826599 | 0.256482 | 1.000000 | 1.0 | 0.486905 | 1.000000 | 0.794118 | 1.000000 | 0.794118 | 1.000000 | 0.875890 | 1.0 | 0.931837 | 1.0 |
| 1 | 2.803414 | 0.248255 | 0.975309 | 1.0 | 0.535396 | 0.999643 | 0.800000 | 0.999344 | 0.800000 | 0.999344 | 0.903352 | 1.0 | 0.936920 | 1.0 |
| 2 | 2.799333 | 0.295478 | 1.000000 | 1.0 | 0.496335 | 1.000000 | 0.758824 | 1.000000 | 0.758824 | 1.000000 | 0.848834 | 1.0 | 0.914702 | 1.0 |
| 3 | 2.814667 | 0.365639 | 0.962963 | 1.0 | 0.495208 | 1.000000 | 0.758824 | 1.000000 | 0.758824 | 1.000000 | 0.881525 | 1.0 | 0.933310 | 1.0 |
| 4 | 2.851409 | 0.214412 | 0.987654 | 1.0 | 0.493267 | 1.000000 | 0.758824 | 1.000000 | 0.758824 | 1.000000 | 0.884278 | 1.0 | 0.929192 | 1.0 |
| 5 | 2.808505 | 0.241189 | 0.987654 | 1.0 | 0.481495 | 0.999643 | 0.781065 | 0.999345 | 0.781065 | 0.999345 | 0.885955 | 1.0 | 0.938447 | 1.0 |
| 6 | 2.849863 | 0.259715 | 0.975610 | 1.0 | 0.504044 | 1.000000 | 0.769231 | 1.000000 | 0.769231 | 1.000000 | 0.899207 | 1.0 | 0.943568 | 1.0 |
| 7 | 2.866695 | 0.289637 | 1.000000 | 1.0 | 0.513072 | 1.000000 | 0.804734 | 1.000000 | 0.804734 | 1.000000 | 0.904886 | 1.0 | 0.941256 | 1.0 |
| 8 | 1.839791 | 0.127954 | 1.000000 | 1.0 | 0.537364 | 1.000000 | 0.804734 | 1.000000 | 0.804734 | 1.000000 | 0.903801 | 1.0 | 0.948841 | 1.0 |
| 9 | 1.714739 | 0.159592 | 0.987654 | 1.0 | 0.428331 | 1.000000 | 0.775148 | 1.000000 | 0.775148 | 1.000000 | 0.893021 | 1.0 | 0.944882 | 1.0 |
precision recall f1-score support
Colorectum 0.61 0.85 0.71 346
Lung 0.52 0.34 0.41 94
Breast 0.53 0.53 0.53 174
Pancreas 0.74 0.59 0.65 82
Ovary 0.81 0.71 0.76 48
Esophagus 0.22 0.05 0.08 41
Liver 0.55 0.32 0.40 38
Stomach 0.30 0.10 0.15 60
Healthy 0.98 0.99 0.98 812
accuracy 0.78 1695
macro avg 0.58 0.50 0.52 1695
weighted avg 0.76 0.78 0.76 1695
Cancer samples correctly classified (sensitivity): 866, out of 883 (98.1%)
Total precision on cancer samples: 98.9%
| Scores | Std | |
|---|---|---|
| Specificity (med) | 0.9877 | 0.0123 |
| Sensitivity (med) | 0.4958 | 0.0291 |
| Sensitivity weighted (med) | 0.7781 | 0.0291 |
| AUC (med) | 0.8895 | 0.0163 |
| Specificity (mean) | 0.9877 | 0.0123 |
| Sensitivity (mean) | 0.4971 | 0.0291 |
| Sensitivity weighted (mean) | 0.7805 | 0.0291 |
| AUC (mean) | 0.8881 | 0.0163 |
87.2% <= AUC <= 90.4%
# Transform the train and test set in the same way as in the pipeline
pt = PercentileTransformer()
sc = StandardScaler()
pt.fit(trainX, trainY)
sc.fit(trainX)
testX = pt.transform(testX)
testX = sc.transform(testX)
# Create dataframe so feature names are shown
testX = pd.DataFrame(testX, columns=numerical_features)
# Plot Shap values
shap_values = shap.TreeExplainer(best_models[name],
feature_perturbation="tree_path_dependent").shap_values(testX)[1]
shap.summary_plot(shap_values, testX)
Setting feature_perturbation = "tree_path_dependent" because no background data was given.
Aneuploidy only¶Results are considerable lower when only the single Aneuploidy feature is considered during modelling. At most, 80% of the cancer samples are correctly classified with a precision at 94% when Random Forest is used as meta estimator on the KNN, Gradient Boosting, CatBoost, LightGBM and XGBoost predictions.
# Plot Confusion Matrix
plot_confusion_matrix(Ya, preds_stacking2_A, target_names=[i for i in cancers[:9]],
title='Confusion Matrix')
# Plot sensitivities
plot_sensitivities(preds_stacking2_A, Ya, title='Sensitivity per Cancer Type')
cSamples = sum((preds_stacking2_A != 9) & (Ya != 9))
ccSamples = sum(preds_stacking2_A != 9)
tot = sum(Ya != 9)
print(f'Cancer samples correctly classified (sensitivity): {cSamples}, out of {tot} ({cSamples/tot:.1%})')
print(f'Total precision on cancer samples: {cSamples/ccSamples:.1%}')
Cancer samples correctly classified (sensitivity): 703, out of 883 (79.6%) Total precision on cancer samples: 93.7%
print(classification_report(Ya, preds_stacking2_A, target_names=cancers[:9]))
precision recall f1-score support
Colorectum 0.40 0.85 0.54 346
Lung 0.00 0.00 0.00 94
Breast 0.25 0.01 0.01 174
Pancreas 0.00 0.00 0.00 82
Ovary 1.00 0.06 0.12 48
Esophagus 0.00 0.00 0.00 41
Liver 0.00 0.00 0.00 38
Stomach 0.00 0.00 0.00 60
Healthy 0.81 0.94 0.87 812
accuracy 0.63 1695
macro avg 0.27 0.21 0.17 1695
weighted avg 0.52 0.63 0.53 1695
Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
# Print performance
performance_stacking2_A = cv_score_summary(cvScores_stacking2_A)
display(performance_stacking2_A)
# Calculate AUC with standard deviations
med = performance_stacking2_A.loc['AUC (mean)', 'Scores']
std = performance_stacking2_A.loc['AUC (mean)', 'Std']
print("{:.1%} <= AUC <= {:.1%}".format(med-std, med+std))
| Scores | Std | |
|---|---|---|
| Specificity (med) | 0.9444 | 0.0227 |
| Sensitivity (med) | 0.2022 | 0.0141 |
| Sensitivity weighted (med) | 0.6294 | 0.0141 |
| AUC (med) | 0.6188 | 0.0309 |
| Specificity (mean) | 0.9408 | 0.0227 |
| Sensitivity (mean) | 0.2053 | 0.0141 |
| Sensitivity weighted (mean) | 0.6254 | 0.0141 |
| AUC (mean) | 0.6194 | 0.0309 |
58.8% <= AUC <= 65.0%
Two different set of models have been optimized
Aneuploidy, Mutation, AFP, CA-125, CA19-9, CEA, HGF, OPN, Prolactin and TIMP-1 Aneuploidy feature only. By using the entire 10 feature set the results are considerable better than when only the single Aneuploidy feature is used. It correctly classifies 98% of the cancer samples (compared with only Aneuploidy at 80%) with 99% precision (94%). Specificity, fraction of correctly classified healthy samples, is 99% with corresponding precision at 98%. It shall be noted that the VotingClassifier performed well as well, achieving slightly higher specificity at above 99%.
The results on the entire feature set are promising, giving confident indications on whether a patient has cancer or not.